Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Machine Learning: End-to-End guide for Java developers

You're reading from   Machine Learning: End-to-End guide for Java developers Data Analysis, Machine Learning, and Neural Networks simplified

Arrow left icon
Product type Course
Published in Oct 2017
Publisher Packt
ISBN-13 9781788622219
Length 1159 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Krishna Choppella Krishna Choppella
Author Profile Icon Krishna Choppella
Krishna Choppella
Uday Kamath Uday Kamath
Author Profile Icon Uday Kamath
Uday Kamath
Arrow right icon
View More author details
Toc

Chapter 2. Data Acquisition

It is never much fun to work with code that is not formatted properly or uses variable names that do not convey their intended purpose. The same can be said of data, except that bad data can result in inaccurate results. Thus, data acquisition is an important step in the analysis of data. Data is available from a number of sources but must be retrieved and ultimately processed before it can be useful. It is available from a variety of sources. We can find it in numerous public data sources as simple files, or it may be found in more complex forms across the Internet. In this chapter, we will demonstrate how to acquire data from several of these, including various Internet sites and several social media sites.

We can access data from the Internet by downloading specific files or through a process known as web scraping, which involves extracting the contents of a web page. We also explore a related topic known as web crawling, which involves applications that examine a web site to determine whether it is of interest and then follows embedded links to identify other potentially relevant pages.

We can also extract data from social media sites. These types of sites often hold a treasure trove of data that is readily available if we know how to access it. In this chapter, we will demonstrate how to extract data from several sites, including:

  • Twitter
  • Wikipedia
  • Flickr
  • YouTube

When extracting data from a site, many different data formats may be encountered. We will examine three basic types: text, audio, and video. However, even within text, audio, and video data, many formats exist. For audio data alone, there are 45 audio coding formats compared at https://en.wikipedia.org/wiki/Comparison_of_audio_coding_formats. For textual data, there are almost 300 formats listed at http://fileinfo.com/filetypes/text. In this chapter, we will focus on how to download and extract these types of text as plain text for eventual processing.

We will briefly examine different data formats, followed by an examination of possible data sources. We need this knowledge to demonstrate how to obtain data using different data acquisition techniques.

Understanding the data formats used in data science applications

When we discuss data formats, we are referring to content format, as opposed to the underlying file format, which may not even be visible to most developers. We cannot examine all available formats due to the vast number of formats available. Instead, we will tackle several of the more common formats, providing adequate examples to address the most common data retrieval needs. Specifically, we will demonstrate how to retrieve data stored in the following formats:

  • HTML
  • PDF
  • CSV/TSV
  • Spreadsheets
  • Databases
  • JSON
  • XML

Some of these formats are well supported and documented elsewhere. For example, XML has been in use for years and there are several well-established techniques for accessing XML data in Java. For these types of data, we will outline the major techniques available and show a few examples to illustrate how they work. This will provide those readers who are not familiar with the technology some insight into their nature.

The most common data format is binary files. For example, Word, Excel, and PDF documents are all stored in binary. These require special software to extract information from them. Text data is also very common.

Overview of CSV data

Comma Separated Values (CSV) files, contain tabular data organized in a row-column format. The data, stored as plaintext, is stored in rows, also called records. Each record contains fields separated by commas. These files are also closely related to other delimited files, most notably Tab-Separated Values (TSV) files. The following is a part of a simple CSV file, and these numbers are not intended to represent any specific type of data:

JURISDICTION NAME,COUNT PARTICIPANTS,COUNT FEMALE,PERCENT FEMALE 
10001,44,22,0.5 
10002,35,19,0.54 
10003,1,1,1 

Notice that the first row contains header data to describe the subsequent records. Each value is separated by a comma and corresponds to the header in the same position. In Chapter 3, Data Cleaning, we will discuss CSV files in more depth and examine the support available for different types of delimiters.

Overview of spreadsheets

Spreadsheets are a form of tabular data where information is stored in rows and columns, much like a two-dimensional array. They typically contain numeric and textual information and use formulas to summarize and analyze their contents. Most people are familiar with Excel spreadsheets, but they are also found as part of other product suites, such as OpenOffice.

Spreadsheets are an important data source because they have been used for the past several decades to store information in many industries and applications. Their tabular nature makes them easy to process and analyze. It is important to know how to extract data from this ubiquitous data source so that we can take advantage of the wealth of information that is stored in them.

For some of our examples, we will use a simple Excel spreadsheet that consists of a series of rows containing an ID, along with minimum, maximum, and average values. These numbers are not intended to represent any specific type of data. The spreadsheet looks like this:

ID

Minimum

Maximum

Average

12345

45

89

65.55

23456

78

96

86.75

34567

56

89

67.44

45678

86

99

95.67

In Chapter 3, Data Cleaning, we will learn how to extract data from spreadsheets.

Overview of databases

Data can be found in Database Management Systems (DBMS), which, like spreadsheets, are ubiquitous. Java provides a rich set of options for accessing and processing data in a DBMS. The intent of this section is to provide a basic introduction to database access using Java.

We will demonstrate the essence of connecting to a database, storing information, and retrieving information using JDBC. For this example, we used the MySQL DBMS. However, it will work for other DBMSes as well with a change in the database driver. We created a database called example and a table called URLTABLE using the following command within the MySQL Workbench. There are other tools that can achieve the same results:

CREATE TABLE IF NOT EXISTS `URLTABLE` ( 
  `RecordID` INT(11) NOT NULL AUTO_INCREMENT, 
  `URL` text NOT NULL, 
  PRIMARY KEY (`RecordID`) 
); 

We start with a try block to handle exceptions. A driver is needed to connect to the DBMS. In this example, we used com.mysql.jdbc.Driver. To connect to the database, the getConnection method is used, where the database server location, user ID, and password are passed. These values depend on the DBMS used:

    try { 
        Class.forName("com.mysql.jdbc.Driver"); 
        Stri­ng url = "jdbc:mysql://localhost:3306/example"; 
        connection = DriverManager.getConnection(url, "user ID", 
            "password"); 
            ... 
    } catch (SQLException | ClassNotFoundException ex) { 
        // Handle exceptions 
    } 

Next, we will illustrate how to add information to the database and then how to read it. The SQL INSERT command is constructed in a string. The name of the MySQL database is example. This command will insert values into the URLTABLE table in the database where the question mark is a placeholder for the value to be inserted:

    String insertSQL = "INSERT INTO  `example`.`URLTABLE` " 
        + "(`url`) VALUES " + "(?);"; 

The PreparedStatement class represents an SQL statement to execute. The prepareStatement method creates an instance of the class using the INSERT SQL statement:

    PreparedStatement stmt = connection.prepareStatement(insertSQL); 

We then add URLs to the table using the setString method and the execute method. The setString method possesses two arguments. The first specifies the column index to insert the data and the second is the value to be inserted. The execute method does the actual insertion. We add two URLs in the next sequence:

    stmt.setString(1, "https://en.wikipedia.org/wiki/Data_science"); 
    stmt.execute(); 
    stmt.setString(1,  
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 
    stmt.execute(); 

To read the data, we use a SQL SELECT statement as declared in the selectSQL string. This will return all the rows and columns from the URLTABLE table. The createStatement method creates an instance of a Statement class, which is used for INSERT type statements. The executeQuery method executes the query and returns a ResultSet instance that holds the contents of the table:

    String selectSQL = "select * from URLTABLE"; 
    Statement statement = connection.createStatement(); 
    ResultSet resultSet = statement.executeQuery(selectSQL); 

The following sequence iterates through the table, displaying one row at a time. The argument of the getString method specifies that we want to use the second column of the result set, which corresponds to the URL field:

    out.println("List of URLs"); 
    while (resultSet.next()) { 
        out.println(resultSet.getString(2)); 
    }  

The output of this example, when executed, is as follows:

List of URLs
https://en.wikipedia.org/wiki/Data_science
https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly

If you need to empty the contents of the table, use the following sequence:

    Statement statement = connection.createStatement(); 
    statement.execute("TRUNCATE URLTABLE;"); 

This was a brief introduction to database access using Java. There are many resources available that will provide more in-depth coverage of this topic. For example, Oracle provides a more in-depth introduction to this topic at https://docs.oracle.com/javase/tutorial/jdbc/.

Overview of PDF files

The Portable Document Format (PDF) is a format not tied to a specific platform or software application. A PDF document can hold formatted text and images. PDF is an open standard, making it useful in a variety of places.

There are a large number of documents stored as PDF, making it a valuable source of data. There are several Java APIs that allow access to PDF documents, including Apache POI and PDFBox. Techniques for extracting information from a PDF document are illustrated in Chapter 3, Data Cleaning.

Overview of JSON

JavaScript Object Notation (JSON) (http://www.JSON.org/) is a data format used to interchange data. It is easy for humans or machines to read and write. JSON is supported by many languages, including Java, which has several JSON libraries listed at http://www.JSON.org/.

A JSON entity is composed of a set of name-value pairs enclosed in curly braces. We will use this format in several of our examples. In handling YouTube, we will use a JSON object, some of which is shown next, representing the results of a request from a YouTube video:

{ 
  "kind": "youtube#searchResult", 
  "etag": "etag", 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  ... 
} 

Accessing the fields and values of such a document is not hard and is illustrated in Chapter 3, Data Cleaning.

Overview of XML

Extensible Markup Language (XML) is a markup language that specifies a standard document format. Widely used to communicate between applications and across the Internet, XML is popular due to its relative simplicity and flexibility. Documents encoded in XML are character-based and easily read by machines and humans.

XML documents contain markup and content characters. These characters allow parsers to classify the information contained within the document. The document consists of tags, and elements are stored within the tags. Elements may also contain other markup tags and form child elements. Additionally, elements may contain attributes or specific characteristics stored as a name-and-value pair.

An XML document must be well-formed. This means it must follow certain rules such as always using closing tags and only a single root tag. Other rules are discussed at https://en.wikipedia.org/wiki/XML#Well-formedness_and_error-handling.

The Java API for XML Processing (JAXP) consists of three interfaces for parsing XML data. The Document Object Model (DOM) interface parses an XML document and returns a tree structure delineating the structure of the document. The DOM interface parses an entire document as a whole. Alternatively, the Simple API for XML (SAX) parses a document one element at a time. SAX is preferable when memory usage is a concern as DOM requires more resources to construct the tree. DOM, however, offers flexibility over SAX in that any element can be accessed at any time and in any order.

The third Java API is known as Streaming API for XML (StAX). This streaming model was designed to accommodate the best parts of DOM and SAX models by granting flexibility without sacrificing resources. StAX exhibits higher performance, with the trade-off being that access is only available to one location in a document at a time. StAX is the preferred technique if you already know how you want to process the document, but it is also popular for applications with limited available memory.

The following is a simple XML file. Each <text> represents a tag, labelling the element contained within the tags. In this case, the largest node in our file is <music> and contained within it are sets of song data. Each tag within a <song> tag describes another element corresponding to that song. Every tag will eventually have a closing tag, such as </song>. Notice that the first tag contains information about which XML version should be used to parse the file:

<?xml version="1.0"?> 
<music> 
   <song id="1234"> 
      <artist>Patton, Courtney</artist> 
      <name>So This Is Life</name> 
      <genre>Country</genre> 
      <price>2.99</price> 
   </song> 
   <song id="5678"> 
      <artist>Eady, Jason</artist> 
      <name>AM Country Heaven</name> 
      <genre>Country</genre> 
      <price>2.99</price> 
   </song> 
</music> 

There are numerous other XML-related technologies. For example, we can validate a specific XML document using either a DTD document or XML schema writing specifically for that XML document. XML documents can be transformed into a different format using XLST.

Overview of streaming data

Streaming data refers to data generated in a continuous stream and accessed in a sequential, piece-by-piece manner. Much of the data the average Internet user accesses is streamed, including video and audio channels, or text and image data on social media sites. Streaming data is the preferred method when the data is new and changing quickly, or when large data collections are sought.

Streamed data is often ideal for data science research because it generally exists in large quantities and raw format. Much public streaming data is available for free and supported by Java APIs. In this chapter, we are going to examine how to acquire data from streaming sources, including Twitter, Flickr, and YouTube. Despite the use of different techniques and APIs, you will notice similarities between the techniques used to pull data from these sites.

Overview of audio/video/images in Java

There are a large number of formats used to represent images, videos, and audio. This type of data is typically stored in binary format. Analog audio streams are sampled and digitized. Images are often simply collections of bits representing the color of a pixel. The following are links that provide a more in-depth discussion of some of these formats:

Frequently, this type of data can be quite large and must be compressed. When data is compressed two approaches are used. The first is a lossless compression, where less space is used and there is no loss of information. The second is lossy, where information is lost. Losing information is not always a bad thing as sometimes the loss is not noticeable to humans.

As we will demonstrate in Chapter 3, Data Cleaning, this type of data often is compromised in an inconvenient fashion and may need to be cleaned. For example, there may be background noise in an audio recording or an image may need to be smoothed before it can be processed. Image smoothing is demonstrated in Chapter 3, Data Cleaning, using the OpenCV library.

Data acquisition techniques

In this section, we will illustrate how to acquire data from web pages. Web pages contain a potential bounty of useful information. We will demonstrate how to access web pages using several technologies, starting with a low-level approach supported by the HttpUrlConnection class. To find pages, a web crawler application is often used. Once a useful page has been identified, then information needs to be extracted from the page. This is often performed using an HTML parser. Extracting this information is important because it is often buried amid a clutter of HTML tags and JavaScript code.

Using the HttpUrlConnection class

The contents of a web page can be accessed using the HttpUrlConnection class. This is a low-level approach that requires the developer to do a lot of footwork to extract relevant content. However, he or she is able to exercise greater control over how the content is handled. In some situations, this approach may be preferable to using other API libraries.

We will demonstrate how to download the content of Wikipedia's data science page using this class. We start with a try/catch block to handle exceptions. A URL object is created using the data science URL string. The openConnection method will create a connection to the Wikipedia server as shown here:

    try { 
        URL url = new URL( 
            "https://en.wikipedia.org/wiki/Data_science"); 
        HttpURLConnection connection = (HttpURLConnection)  
            url.openConnection(); 
       ... 
    } catch (MalformedURLException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

The connection object is initialized with an HTTP GET command. The connect method is then executed to connect to the server:

    connection.setRequestMethod("GET"); 
    connection.connect(); 

Assuming no errors were encountered, we can determine whether the response was successful using the getResponseCode method. A normal return value is 200. The content of a web page can vary. For example, the getContentType method returns a string describing the page's content. The getContentLength method returns its length:

    out.println("Response Code: " + connection.getResponseCode()); 
    out.println("Content Type: " + connection.getContentType()); 
    out.println("Content Length: " + connection.getContentLength()); 

Assuming that we get an HTML formatted page, the next sequence illustrates how to get this content. A BufferedReader instance is created where one line at a time is read in from the web site and appended to a BufferedReader instance. The buffer is then displayed:

    InputStreamReader isr = new InputStreamReader((InputStream)  
        connection.getContent()); 
    BufferedReader br = new BufferedReader(isr); 
    StringBuilder buffer = new StringBuilder(); 
    String line; 
    do { 
        line = br.readLine(); 
        buffer.append(line + "\n"); 
    } while (line != null); 
    out.println(buffer.toString()); 

The abbreviated output is shown here:

Response Code: 200 
Content Type: text/html; charset=UTF-8 
Content Length: -1
<!DOCTYPE html> 
<html lang="en" dir="ltr" class="client-nojs"> 
<head> 
<meta charset="UTF-8"/>
<title>Data science - Wikipedia, the free encyclopedia</title> 
<script>document.documentElement.className =
...
"wgHostname":"mw1251"});});</script> 
</body>
</html>

While this is feasible, there are easier methods for getting the contents of a web page. One of these techniques is discussed in the next section.

Web crawlers in Java

Web crawling is the process of traversing a series of interconnected web pages and extracting relevant information from those pages. It does this by isolating and then following links on a page. While there are many precompiled datasets readily available, it may still be necessary to collect data directly off the Internet. Some sources such as news sites are continually being updated and need to be revisited from time to time.

A web crawler is an application that visits various sites and collects information. The web crawling process consists of a series of steps:

  1. Select a URL to visit
  2. Fetch the page
  3. Parse the page
  4. Extract relevant content
  5. Extract relevant URLs to visit

This process is repeated for each URL visited.

There are several issues that need to be considered when fetching and parsing a page such as:

  • Page importance: We do not want to process irrelevant pages.
  • Exclusively HTML: We will not normally follow links to images, for example.
  • Spider traps: We want to bypass sites that may result in an infinite number of requests. This can occur with dynamically generated pages where one request leads to another.
  • Repetition: It is important to avoid crawling the same page more than once.
  • Politeness: Do not make an excessive number of requests to a website. Observe the robot.txt files; they specify which parts of a site should not be crawled.

The process of creating a web crawler can be daunting. For all but the simplest needs, it is recommended that one of several open source web crawlers be used. A partial list follows:

We can either create our own web crawler or use an existing crawler and in this chapter we will examine both approaches. For specialized processing, it can be desirable to use a custom crawler. We will demonstrate how to create a simple web crawler in Java to provide more insight into how web crawlers work. This will be followed by a brief discussion of other web crawlers.

Creating your own web crawler

Now that we have a basic understanding of web crawlers, we are ready to create our own. In this simple web crawler, we will keep track of the pages visited using ArrayList instances. In addition, jsoup will be used to parse a web page and we will limit the number of pages we visit. Jsoup (https://jsoup.org/) is an open source HTML parser. This example demonstrates the basic structure of a web crawler and also highlights some of the issues involved in creating a web crawler.

We will use the SimpleWebCrawler class, as declared here:

public class SimpleWebCrawler { 
 
    private String topic; 
    private String startingURL; 
    private String urlLimiter; 
    private final int pageLimit = 20; 
    private ArrayList<String> visitedList = new ArrayList<>(); 
    private ArrayList<String> pageList = new ArrayList<>(); 
    ... 
    public static void main(String[] args) { 
        new SimpleWebCrawler(); 
    } 
 
} 

The instance variables are detailed here:

Variable

Use

topic

The keyword that needs to be in a page for the page to be accepted

startingURL

The URL of the first page

urlLimiter

A string that must be contained in a link before it will be followed

pageLimit

The maximum number of pages to retrieve

visitedList

The ArrayList containing pages that have already been visited

pageList

An ArrayList containing the URLs of the pages of interest

In the SimpleWebCrawler constructor, we initialize the instance variables to begin the search from the Wikipedia page for Bishop Rock, an island off the coast of Italy. This was chosen to minimize the number of pages that might be retrieved. As we will see, there are many more Wikipedia pages dealing with Bishop Rock than one might think.

The urlLimiter variable is set to Bishop_Rock, which will restrict the embedded links to follow to just those containing that string. Each page of interest must contain the value stored in the topic variable. The visitPage method performs the actual crawl:

    public SimpleWebCrawler() { 
        startingURL = https://en.wikipedia.org/wiki/Bishop_Rock, " 
            + "Isles_of_Scilly"; 
        urlLimiter = "Bishop_Rock"; 
        topic = "shipping route"; 
        visitPage(startingURL); 
    } 

In the visitPage method, the pageList ArrayList is checked to see whether the maximum number of accepted pages has been exceeded. If the limit has been exceeded, then the search terminates:

    public void visitPage(String url) { 
        if (pageList.size() >= pageLimit) { 
            return; 
        } 
       ... 
    } 

If the page has already been visited, then we ignore it. Otherwise, it is added to the visited list:

    if (visitedList.contains(url)) { 
        // URL already visited 
    } else { 
        visitedList.add(url); 
            ... 
    } 

Jsoup is used to parse the page and return a Document object. There are many different exceptions and problems that can occur such as a malformed URL, retrieval timeouts, or simply bad links. The catch block needs to handle these types of problems. We will provide a more in-depth explanation of jsoup in web scraping in Java:

    try { 
        Document doc = Jsoup.connect(url).get(); 
            ... 
        } 
    } catch (Exception ex) { 
        // Handle exceptions 
    } 

If the document contains the topic text, then the link is displayed and added to the pageList ArrayList. Each embedded link is obtained, and if the link contains the limiting text, then the visitPage method is called recursively:

    if (doc.text().contains(topic)) { 
        out.println((pageList.size() + 1) + ": [" + url + "]"); 
        pageList.add(url); 
 
        // Process page links 
        Elements questions = doc.select("a[href]"); 
        for (Element link : questions) { 
            if (link.attr("href").contains(urlLimiter)) { 
                visitPage(link.attr("abs:href")); 
            } 
        } 
    } 

This approach only examines links in those pages that contain the topic text. Moving the for loop outside of the if statement will test the links for all pages.

The output follows:

1: [https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly]
2: [https://en.wikipedia.org/wiki/Bishop_Rock_Lighthouse]
3: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=717634231#Lighthouse]
4: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=717634231]
5: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716622943]
6: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716622943]
7: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716608512]
8: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716608512]
...
20: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716603919]

In this example, we did not save the results of the crawl in an external source. Normally this is necessary and can be stored in a file or database.

Using the crawler4j web crawler

Here we will illustrate the use of the crawler4j (https://github.com/yasserg/crawler4j) web crawler. We will use an adapted version of the basic crawler found at https://github.com/yasserg/crawler4j/tree/master/src/test/java/edu/uci/ics/crawler4j/examples/basic. We will create two classes: CrawlerController and SampleCrawler. The former class set ups the crawler while the latter contains the logic that controls what pages will be processed.

As with our previous crawler, we will crawl the Wikipedia article dealing with Bishop Rock. The results using this crawler will be smaller as many extraneous pages are ignored.

Let's look at the CrawlerController class first. There are several parameters that are used with the crawler as detailed here:

  • Crawl storage folder: The location where crawl data is stored
  • Number of crawlers: This controls the number of threads used for the crawl
  • Politeness delay: How many seconds to pause between requests
  • Crawl depth: How deep the crawl will go
  • Maximum number of pages to fetch: How many pages to fetch
  • Binary data: Whether to crawl binary data such as PDF files

The basic class is shown here:

public class CrawlerController { 
 
  public static void main(String[] args) throws Exception { 
    int numberOfCrawlers = 2; 
    CrawlConfig config = new CrawlConfig(); 
    String crawlStorageFolder = "data"; 
     
    config.setCrawlStorageFolder(crawlStorageFolder); 
    config.setPolitenessDelay(500); 
    config.setMaxDepthOfCrawling(2); 
    config.setMaxPagesToFetch(20); 
    config.setIncludeBinaryContentInCrawling(false); 
    ... 
  } 
}

Next, the CrawlController class is created and configured. Notice the RobotstxtConfig and RobotstxtServer classes used to handle robot.txt files. These files contain instructions that are intended to be read by a web crawler. They provide direction to help a crawler to do a better job such as specifying which parts of a site should not be crawled. This is useful for auto generated pages:

    PageFetcher pageFetcher = new PageFetcher(config); 
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
    RobotstxtServer robotstxtServer =  
        new RobotstxtServer(robotstxtConfig, pageFetcher); 
    CrawlController controller =  
        new CrawlController(config, pageFetcher, robotstxtServer); 

The crawler needs to start at one or more pages. The addSeed method adds the starting pages. While we used the method only once here, it can be used as many times as needed:

    controller.addSeed( 
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 

The start method will begin the crawling process:

    controller.start(SampleCrawler.class, numberOfCrawlers); 

The SampleCrawler class contains two methods of interest. The first is the shouldVisit method that determines whether a page will be visited and the visit method that actually handles the page. We start with the class declaration and the declaration of a Java regular expression class Pattern object. It will be one way of determining whether a page will be visited. In this declaration, standard images are specified and will be ignored:

    public class SampleCrawler extends WebCrawler { 
        private static final Pattern IMAGE_EXTENSIONS =  
            Pattern.compile(".*\\.(bmp|gif|jpg|png)$"); 
 
        ... 
    } 

The shouldVisit method is passed a reference to the page where this URL was found along with the URL. If any of the images match, the method returns false and the page is ignored. In addition, the URL must start with https://en.wikipedia.org/wiki/. We added this to restrict our searches to the Wikipedia website:

    public boolean shouldVisit(Page referringPage, WebURL url) { 
        String href = url.getURL().toLowerCase(); 
        if (IMAGE_EXTENSIONS.matcher(href).matches()) { 
            return false; 
        } 
        return href.startsWith("https://en.wikipedia.org/wiki/"); 
    }

The visit method is passed a Page object representing the page being visited. In this implementation, only those pages containing the string shipping route will be processed. This further restricts the pages visited. When we find such a page, its URL, Text, and Text length are displayed:

    public void visit(Page page) { 
        String url = page.getWebURL().getURL(); 
 
        if (page.getParseData() instanceof HtmlParseData) { 
            HtmlParseData htmlParseData =  
                (HtmlParseData) page.getParseData(); 
            String text = htmlParseData.getText(); 
            if (text.contains("shipping route")) { 
                out.println("\nURL: " + url); 
                out.println("Text: " + text); 
                out.println("Text length: " + text.length()); 
            } 
        } 
    } 

The following is the truncated output of the program when executed:

URL: https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly
Text: Bishop Rock, Isles of Scilly...From Wikipedia, the free encyclopedia ... Jump to: ... navigation, search For the Bishop Rock in the Pacific Ocean, see Cortes Bank. Bishop Rock Bishop Rock Lighthouse (2005)
...
Text length: 14677

Notice that only one page was returned. This web crawler was able to identify and ignore previous versions of the main web page.

We could perform further processing, but this example provides some insight into how the API works. Significant amounts of information can be obtained when visiting a page. In the example, we only used the URL and the length of the text. The following is a sample of other data that you may be interested in obtaining:

  • URL path
  • Parent URL
  • Anchor
  • HTML text
  • Outgoing links
  • Document ID

Web scraping in Java

Web scraping is the process of extracting information from a web page. The page is typically formatted using a series of HTML tags. An HTML parser is used to navigate through a page or series of pages and to access the page's data or metadata.

Jsoup (https://jsoup.org/) is an open source Java library that facilitates extracting and manipulating HTML documents using an HTML parser. It is used for a number of purposes, including web scraping, extracting specific elements from an HTML page, and cleaning up HTML documents.

There are several ways of obtaining an HTML document that may be useful. The HTML document can be extracted from a:

  • URL
  • String
  • File

The first approach is illustrated next where the Wikipedia page for data science is loaded into a Document object. This Jsoup object represents the HTML document. The connect method connects to the site and the get method retrieves the document:

    try { 
        Document document = Jsoup.connect( 
            "https://en.wikipedia.org/wiki/Data_science").get(); 
        ... 
     } catch (IOException ex) { 
        // Handle exception 
    } 

Loading from a file uses the File class as shown next. The overloaded parse method uses the file to create the document object:

    try { 
        File file = new File("Example.html"); 
        Document document = Jsoup.parse(file, "UTF-8", ""); 
        ... 
    } catch (IOException ex) { 
        // Handle exception 
    } 

The Example.html file follows:

<html> 
<head><title>Example Document</title></head> 
<body> 
<p>The body of the document</p> 
Interesting Links: 
<br> 
<a href="https://en.wikipedia.org/wiki/Data_science">Data Science</a> 
<br> 
<a href="https://en.wikipedia.org/wiki/Jsoup">Jsoup</a> 
<br> 
Images: 
<br> 
 <img src="eyechart.jpg" alt="Eye Chart">  
</body> 
</html> 

To create a Document object from a string, we will use the following sequence where the parse method processes the string that duplicates the previous HTML file:

    String html = "<html>\n" 
        + "<head><title>Example Document</title></head>\n" 
        + "<body>\n" 
        + "<p>The body of the document</p>\n" 
        + "Interesting Links:\n" 
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Data_science">" + 
          "DataScience</a>\n"
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Jsoup">" + 
          "Jsoup</a>\n"
        + "<br>\n" 
        + "Images:\n" 
        + "<br>\n" 
        + " <img src="eyechart.jpg" alt="Eye Chart"> \n"
        + "</body>\n" 
        + "</html>"; 
    Document document = Jsoup.parse(html);

The Document class possesses a number of useful methods. The title method returns the title. To get the text contents of the document, the select method is used. This method uses a string specifying the element of a document to retrieve:

    String title = document.title(); 
    out.println("Title: " + title); 
    Elements element = document.select("body"); 
    out.println("  Text: " + element.text()); 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Title: Data science - Wikipedia, the free encyclopedia
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Part of a 
...
policy About Wikipedia Disclaimers Contact Wikipedia Developers Cookie statement Mobile view

The parameter type of the select method is a string. By using a string, the type of information selected is easily changed. Details on how to formulate this string are found at the jsoup Javadocs for the Selector class at https://jsoup.org/apidocs/:

We can use the select method to retrieve the images in a document, as shown here:

    Elements images = document.select("img[src$=.png]"); 
    for (Element image : images) { 
        out.println("\nImage: " + image); 
    } 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Image: <img alt="Data Visualization" src="//upload.wikimedia.org/...>
Image: <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/...>

Links can be easily retrieved as shown next:

    Elements links = document.select("a[href]"); 
    for (Element link : links) { 
        out.println("Link: " + link.attr("href") 
            + " Text: " + link.text()); 
    } 

The output for the Example.html page is shown here:

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science
Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup

jsoup possesses many additional capabilities. However, this example demonstrates the web scraping process. There are also other Java HTML parsers available. A comparison of Java HTML parser, among others, can be found at https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Using the HttpUrlConnection class

The contents of a web page can be accessed using the HttpUrlConnection class. This is a low-level approach that requires the developer to do a lot of footwork to extract relevant content. However, he or she is able to exercise greater control over how the content is handled. In some situations, this approach may be preferable to using other API libraries.

We will demonstrate how to download the content of Wikipedia's data science page using this class. We start with a try/catch block to handle exceptions. A URL object is created using the data science URL string. The openConnection method will create a connection to the Wikipedia server as shown here:

    try { 
        URL url = new URL( 
            "https://en.wikipedia.org/wiki/Data_science"); 
        HttpURLConnection connection = (HttpURLConnection)  
            url.openConnection(); 
       ... 
    } catch (MalformedURLException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

The connection object is initialized with an HTTP GET command. The connect method is then executed to connect to the server:

    connection.setRequestMethod("GET"); 
    connection.connect(); 

Assuming no errors were encountered, we can determine whether the response was successful using the getResponseCode method. A normal return value is 200. The content of a web page can vary. For example, the getContentType method returns a string describing the page's content. The getContentLength method returns its length:

    out.println("Response Code: " + connection.getResponseCode()); 
    out.println("Content Type: " + connection.getContentType()); 
    out.println("Content Length: " + connection.getContentLength()); 

Assuming that we get an HTML formatted page, the next sequence illustrates how to get this content. A BufferedReader instance is created where one line at a time is read in from the web site and appended to a BufferedReader instance. The buffer is then displayed:

    InputStreamReader isr = new InputStreamReader((InputStream)  
        connection.getContent()); 
    BufferedReader br = new BufferedReader(isr); 
    StringBuilder buffer = new StringBuilder(); 
    String line; 
    do { 
        line = br.readLine(); 
        buffer.append(line + "\n"); 
    } while (line != null); 
    out.println(buffer.toString()); 

The abbreviated output is shown here:

Response Code: 200 
Content Type: text/html; charset=UTF-8 
Content Length: -1
<!DOCTYPE html> 
<html lang="en" dir="ltr" class="client-nojs"> 
<head> 
<meta charset="UTF-8"/>
<title>Data science - Wikipedia, the free encyclopedia</title> 
<script>document.documentElement.className =
...
"wgHostname":"mw1251"});});</script> 
</body>
</html>

While this is feasible, there are easier methods for getting the contents of a web page. One of these techniques is discussed in the next section.

Web crawlers in Java

Web crawling is the process of traversing a series of interconnected web pages and extracting relevant information from those pages. It does this by isolating and then following links on a page. While there are many precompiled datasets readily available, it may still be necessary to collect data directly off the Internet. Some sources such as news sites are continually being updated and need to be revisited from time to time.

A web crawler is an application that visits various sites and collects information. The web crawling process consists of a series of steps:

  1. Select a URL to visit
  2. Fetch the page
  3. Parse the page
  4. Extract relevant content
  5. Extract relevant URLs to visit

This process is repeated for each URL visited.

There are several issues that need to be considered when fetching and parsing a page such as:

  • Page importance: We do not want to process irrelevant pages.
  • Exclusively HTML: We will not normally follow links to images, for example.
  • Spider traps: We want to bypass sites that may result in an infinite number of requests. This can occur with dynamically generated pages where one request leads to another.
  • Repetition: It is important to avoid crawling the same page more than once.
  • Politeness: Do not make an excessive number of requests to a website. Observe the robot.txt files; they specify which parts of a site should not be crawled.

The process of creating a web crawler can be daunting. For all but the simplest needs, it is recommended that one of several open source web crawlers be used. A partial list follows:

We can either create our own web crawler or use an existing crawler and in this chapter we will examine both approaches. For specialized processing, it can be desirable to use a custom crawler. We will demonstrate how to create a simple web crawler in Java to provide more insight into how web crawlers work. This will be followed by a brief discussion of other web crawlers.

Creating your own web crawler

Now that we have a basic understanding of web crawlers, we are ready to create our own. In this simple web crawler, we will keep track of the pages visited using ArrayList instances. In addition, jsoup will be used to parse a web page and we will limit the number of pages we visit. Jsoup (https://jsoup.org/) is an open source HTML parser. This example demonstrates the basic structure of a web crawler and also highlights some of the issues involved in creating a web crawler.

We will use the SimpleWebCrawler class, as declared here:

public class SimpleWebCrawler { 
 
    private String topic; 
    private String startingURL; 
    private String urlLimiter; 
    private final int pageLimit = 20; 
    private ArrayList<String> visitedList = new ArrayList<>(); 
    private ArrayList<String> pageList = new ArrayList<>(); 
    ... 
    public static void main(String[] args) { 
        new SimpleWebCrawler(); 
    } 
 
} 

The instance variables are detailed here:

Variable

Use

topic

The keyword that needs to be in a page for the page to be accepted

startingURL

The URL of the first page

urlLimiter

A string that must be contained in a link before it will be followed

pageLimit

The maximum number of pages to retrieve

visitedList

The ArrayList containing pages that have already been visited

pageList

An ArrayList containing the URLs of the pages of interest

In the SimpleWebCrawler constructor, we initialize the instance variables to begin the search from the Wikipedia page for Bishop Rock, an island off the coast of Italy. This was chosen to minimize the number of pages that might be retrieved. As we will see, there are many more Wikipedia pages dealing with Bishop Rock than one might think.

The urlLimiter variable is set to Bishop_Rock, which will restrict the embedded links to follow to just those containing that string. Each page of interest must contain the value stored in the topic variable. The visitPage method performs the actual crawl:

    public SimpleWebCrawler() { 
        startingURL = https://en.wikipedia.org/wiki/Bishop_Rock, " 
            + "Isles_of_Scilly"; 
        urlLimiter = "Bishop_Rock"; 
        topic = "shipping route"; 
        visitPage(startingURL); 
    } 

In the visitPage method, the pageList ArrayList is checked to see whether the maximum number of accepted pages has been exceeded. If the limit has been exceeded, then the search terminates:

    public void visitPage(String url) { 
        if (pageList.size() >= pageLimit) { 
            return; 
        } 
       ... 
    } 

If the page has already been visited, then we ignore it. Otherwise, it is added to the visited list:

    if (visitedList.contains(url)) { 
        // URL already visited 
    } else { 
        visitedList.add(url); 
            ... 
    } 

Jsoup is used to parse the page and return a Document object. There are many different exceptions and problems that can occur such as a malformed URL, retrieval timeouts, or simply bad links. The catch block needs to handle these types of problems. We will provide a more in-depth explanation of jsoup in web scraping in Java:

    try { 
        Document doc = Jsoup.connect(url).get(); 
            ... 
        } 
    } catch (Exception ex) { 
        // Handle exceptions 
    } 

If the document contains the topic text, then the link is displayed and added to the pageList ArrayList. Each embedded link is obtained, and if the link contains the limiting text, then the visitPage method is called recursively:

    if (doc.text().contains(topic)) { 
        out.println((pageList.size() + 1) + ": [" + url + "]"); 
        pageList.add(url); 
 
        // Process page links 
        Elements questions = doc.select("a[href]"); 
        for (Element link : questions) { 
            if (link.attr("href").contains(urlLimiter)) { 
                visitPage(link.attr("abs:href")); 
            } 
        } 
    } 

This approach only examines links in those pages that contain the topic text. Moving the for loop outside of the if statement will test the links for all pages.

The output follows:

1: [https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly]
2: [https://en.wikipedia.org/wiki/Bishop_Rock_Lighthouse]
3: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=717634231#Lighthouse]
4: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=717634231]
5: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716622943]
6: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716622943]
7: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716608512]
8: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716608512]
...
20: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716603919]

In this example, we did not save the results of the crawl in an external source. Normally this is necessary and can be stored in a file or database.

Using the crawler4j web crawler

Here we will illustrate the use of the crawler4j (https://github.com/yasserg/crawler4j) web crawler. We will use an adapted version of the basic crawler found at https://github.com/yasserg/crawler4j/tree/master/src/test/java/edu/uci/ics/crawler4j/examples/basic. We will create two classes: CrawlerController and SampleCrawler. The former class set ups the crawler while the latter contains the logic that controls what pages will be processed.

As with our previous crawler, we will crawl the Wikipedia article dealing with Bishop Rock. The results using this crawler will be smaller as many extraneous pages are ignored.

Let's look at the CrawlerController class first. There are several parameters that are used with the crawler as detailed here:

  • Crawl storage folder: The location where crawl data is stored
  • Number of crawlers: This controls the number of threads used for the crawl
  • Politeness delay: How many seconds to pause between requests
  • Crawl depth: How deep the crawl will go
  • Maximum number of pages to fetch: How many pages to fetch
  • Binary data: Whether to crawl binary data such as PDF files

The basic class is shown here:

public class CrawlerController { 
 
  public static void main(String[] args) throws Exception { 
    int numberOfCrawlers = 2; 
    CrawlConfig config = new CrawlConfig(); 
    String crawlStorageFolder = "data"; 
     
    config.setCrawlStorageFolder(crawlStorageFolder); 
    config.setPolitenessDelay(500); 
    config.setMaxDepthOfCrawling(2); 
    config.setMaxPagesToFetch(20); 
    config.setIncludeBinaryContentInCrawling(false); 
    ... 
  } 
}

Next, the CrawlController class is created and configured. Notice the RobotstxtConfig and RobotstxtServer classes used to handle robot.txt files. These files contain instructions that are intended to be read by a web crawler. They provide direction to help a crawler to do a better job such as specifying which parts of a site should not be crawled. This is useful for auto generated pages:

    PageFetcher pageFetcher = new PageFetcher(config); 
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
    RobotstxtServer robotstxtServer =  
        new RobotstxtServer(robotstxtConfig, pageFetcher); 
    CrawlController controller =  
        new CrawlController(config, pageFetcher, robotstxtServer); 

The crawler needs to start at one or more pages. The addSeed method adds the starting pages. While we used the method only once here, it can be used as many times as needed:

    controller.addSeed( 
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 

The start method will begin the crawling process:

    controller.start(SampleCrawler.class, numberOfCrawlers); 

The SampleCrawler class contains two methods of interest. The first is the shouldVisit method that determines whether a page will be visited and the visit method that actually handles the page. We start with the class declaration and the declaration of a Java regular expression class Pattern object. It will be one way of determining whether a page will be visited. In this declaration, standard images are specified and will be ignored:

    public class SampleCrawler extends WebCrawler { 
        private static final Pattern IMAGE_EXTENSIONS =  
            Pattern.compile(".*\\.(bmp|gif|jpg|png)$"); 
 
        ... 
    } 

The shouldVisit method is passed a reference to the page where this URL was found along with the URL. If any of the images match, the method returns false and the page is ignored. In addition, the URL must start with https://en.wikipedia.org/wiki/. We added this to restrict our searches to the Wikipedia website:

    public boolean shouldVisit(Page referringPage, WebURL url) { 
        String href = url.getURL().toLowerCase(); 
        if (IMAGE_EXTENSIONS.matcher(href).matches()) { 
            return false; 
        } 
        return href.startsWith("https://en.wikipedia.org/wiki/"); 
    }

The visit method is passed a Page object representing the page being visited. In this implementation, only those pages containing the string shipping route will be processed. This further restricts the pages visited. When we find such a page, its URL, Text, and Text length are displayed:

    public void visit(Page page) { 
        String url = page.getWebURL().getURL(); 
 
        if (page.getParseData() instanceof HtmlParseData) { 
            HtmlParseData htmlParseData =  
                (HtmlParseData) page.getParseData(); 
            String text = htmlParseData.getText(); 
            if (text.contains("shipping route")) { 
                out.println("\nURL: " + url); 
                out.println("Text: " + text); 
                out.println("Text length: " + text.length()); 
            } 
        } 
    } 

The following is the truncated output of the program when executed:

URL: https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly
Text: Bishop Rock, Isles of Scilly...From Wikipedia, the free encyclopedia ... Jump to: ... navigation, search For the Bishop Rock in the Pacific Ocean, see Cortes Bank. Bishop Rock Bishop Rock Lighthouse (2005)
...
Text length: 14677

Notice that only one page was returned. This web crawler was able to identify and ignore previous versions of the main web page.

We could perform further processing, but this example provides some insight into how the API works. Significant amounts of information can be obtained when visiting a page. In the example, we only used the URL and the length of the text. The following is a sample of other data that you may be interested in obtaining:

  • URL path
  • Parent URL
  • Anchor
  • HTML text
  • Outgoing links
  • Document ID

Web scraping in Java

Web scraping is the process of extracting information from a web page. The page is typically formatted using a series of HTML tags. An HTML parser is used to navigate through a page or series of pages and to access the page's data or metadata.

Jsoup (https://jsoup.org/) is an open source Java library that facilitates extracting and manipulating HTML documents using an HTML parser. It is used for a number of purposes, including web scraping, extracting specific elements from an HTML page, and cleaning up HTML documents.

There are several ways of obtaining an HTML document that may be useful. The HTML document can be extracted from a:

  • URL
  • String
  • File

The first approach is illustrated next where the Wikipedia page for data science is loaded into a Document object. This Jsoup object represents the HTML document. The connect method connects to the site and the get method retrieves the document:

    try { 
        Document document = Jsoup.connect( 
            "https://en.wikipedia.org/wiki/Data_science").get(); 
        ... 
     } catch (IOException ex) { 
        // Handle exception 
    } 

Loading from a file uses the File class as shown next. The overloaded parse method uses the file to create the document object:

    try { 
        File file = new File("Example.html"); 
        Document document = Jsoup.parse(file, "UTF-8", ""); 
        ... 
    } catch (IOException ex) { 
        // Handle exception 
    } 

The Example.html file follows:

<html> 
<head><title>Example Document</title></head> 
<body> 
<p>The body of the document</p> 
Interesting Links: 
<br> 
<a href="https://en.wikipedia.org/wiki/Data_science">Data Science</a> 
<br> 
<a href="https://en.wikipedia.org/wiki/Jsoup">Jsoup</a> 
<br> 
Images: 
<br> 
 <img src="eyechart.jpg" alt="Eye Chart">  
</body> 
</html> 

To create a Document object from a string, we will use the following sequence where the parse method processes the string that duplicates the previous HTML file:

    String html = "<html>\n" 
        + "<head><title>Example Document</title></head>\n" 
        + "<body>\n" 
        + "<p>The body of the document</p>\n" 
        + "Interesting Links:\n" 
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Data_science">" + 
          "DataScience</a>\n"
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Jsoup">" + 
          "Jsoup</a>\n"
        + "<br>\n" 
        + "Images:\n" 
        + "<br>\n" 
        + " <img src="eyechart.jpg" alt="Eye Chart"> \n"
        + "</body>\n" 
        + "</html>"; 
    Document document = Jsoup.parse(html);

The Document class possesses a number of useful methods. The title method returns the title. To get the text contents of the document, the select method is used. This method uses a string specifying the element of a document to retrieve:

    String title = document.title(); 
    out.println("Title: " + title); 
    Elements element = document.select("body"); 
    out.println("  Text: " + element.text()); 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Title: Data science - Wikipedia, the free encyclopedia
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Part of a 
...
policy About Wikipedia Disclaimers Contact Wikipedia Developers Cookie statement Mobile view

The parameter type of the select method is a string. By using a string, the type of information selected is easily changed. Details on how to formulate this string are found at the jsoup Javadocs for the Selector class at https://jsoup.org/apidocs/:

We can use the select method to retrieve the images in a document, as shown here:

    Elements images = document.select("img[src$=.png]"); 
    for (Element image : images) { 
        out.println("\nImage: " + image); 
    } 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Image: <img alt="Data Visualization" src="//upload.wikimedia.org/...>
Image: <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/...>

Links can be easily retrieved as shown next:

    Elements links = document.select("a[href]"); 
    for (Element link : links) { 
        out.println("Link: " + link.attr("href") 
            + " Text: " + link.text()); 
    } 

The output for the Example.html page is shown here:

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science
Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup

jsoup possesses many additional capabilities. However, this example demonstrates the web scraping process. There are also other Java HTML parsers available. A comparison of Java HTML parser, among others, can be found at https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Web crawlers in Java

Web crawling is the process of traversing a series of interconnected web pages and extracting relevant information from those pages. It does this by isolating and then following links on a page. While there are many precompiled datasets readily available, it may still be necessary to collect data directly off the Internet. Some sources such as news sites are continually being updated and need to be revisited from time to time.

A web crawler is an application that visits various sites and collects information. The web crawling process consists of a series of steps:

  1. Select a URL to visit
  2. Fetch the page
  3. Parse the page
  4. Extract relevant content
  5. Extract relevant URLs to visit

This process is repeated for each URL visited.

There are several issues that need to be considered when fetching and parsing a page such as:

  • Page importance: We do not want to process irrelevant pages.
  • Exclusively HTML: We will not normally follow links to images, for example.
  • Spider traps: We want to bypass sites that may result in an infinite number of requests. This can occur with dynamically generated pages where one request leads to another.
  • Repetition: It is important to avoid crawling the same page more than once.
  • Politeness: Do not make an excessive number of requests to a website. Observe the robot.txt files; they specify which parts of a site should not be crawled.

The process of creating a web crawler can be daunting. For all but the simplest needs, it is recommended that one of several open source web crawlers be used. A partial list follows:

We can either create our own web crawler or use an existing crawler and in this chapter we will examine both approaches. For specialized processing, it can be desirable to use a custom crawler. We will demonstrate how to create a simple web crawler in Java to provide more insight into how web crawlers work. This will be followed by a brief discussion of other web crawlers.

Creating your own web crawler

Now that we have a basic understanding of web crawlers, we are ready to create our own. In this simple web crawler, we will keep track of the pages visited using ArrayList instances. In addition, jsoup will be used to parse a web page and we will limit the number of pages we visit. Jsoup (https://jsoup.org/) is an open source HTML parser. This example demonstrates the basic structure of a web crawler and also highlights some of the issues involved in creating a web crawler.

We will use the SimpleWebCrawler class, as declared here:

public class SimpleWebCrawler { 
 
    private String topic; 
    private String startingURL; 
    private String urlLimiter; 
    private final int pageLimit = 20; 
    private ArrayList<String> visitedList = new ArrayList<>(); 
    private ArrayList<String> pageList = new ArrayList<>(); 
    ... 
    public static void main(String[] args) { 
        new SimpleWebCrawler(); 
    } 
 
} 

The instance variables are detailed here:

Variable

Use

topic

The keyword that needs to be in a page for the page to be accepted

startingURL

The URL of the first page

urlLimiter

A string that must be contained in a link before it will be followed

pageLimit

The maximum number of pages to retrieve

visitedList

The ArrayList containing pages that have already been visited

pageList

An ArrayList containing the URLs of the pages of interest

In the SimpleWebCrawler constructor, we initialize the instance variables to begin the search from the Wikipedia page for Bishop Rock, an island off the coast of Italy. This was chosen to minimize the number of pages that might be retrieved. As we will see, there are many more Wikipedia pages dealing with Bishop Rock than one might think.

The urlLimiter variable is set to Bishop_Rock, which will restrict the embedded links to follow to just those containing that string. Each page of interest must contain the value stored in the topic variable. The visitPage method performs the actual crawl:

    public SimpleWebCrawler() { 
        startingURL = https://en.wikipedia.org/wiki/Bishop_Rock, " 
            + "Isles_of_Scilly"; 
        urlLimiter = "Bishop_Rock"; 
        topic = "shipping route"; 
        visitPage(startingURL); 
    } 

In the visitPage method, the pageList ArrayList is checked to see whether the maximum number of accepted pages has been exceeded. If the limit has been exceeded, then the search terminates:

    public void visitPage(String url) { 
        if (pageList.size() >= pageLimit) { 
            return; 
        } 
       ... 
    } 

If the page has already been visited, then we ignore it. Otherwise, it is added to the visited list:

    if (visitedList.contains(url)) { 
        // URL already visited 
    } else { 
        visitedList.add(url); 
            ... 
    } 

Jsoup is used to parse the page and return a Document object. There are many different exceptions and problems that can occur such as a malformed URL, retrieval timeouts, or simply bad links. The catch block needs to handle these types of problems. We will provide a more in-depth explanation of jsoup in web scraping in Java:

    try { 
        Document doc = Jsoup.connect(url).get(); 
            ... 
        } 
    } catch (Exception ex) { 
        // Handle exceptions 
    } 

If the document contains the topic text, then the link is displayed and added to the pageList ArrayList. Each embedded link is obtained, and if the link contains the limiting text, then the visitPage method is called recursively:

    if (doc.text().contains(topic)) { 
        out.println((pageList.size() + 1) + ": [" + url + "]"); 
        pageList.add(url); 
 
        // Process page links 
        Elements questions = doc.select("a[href]"); 
        for (Element link : questions) { 
            if (link.attr("href").contains(urlLimiter)) { 
                visitPage(link.attr("abs:href")); 
            } 
        } 
    } 

This approach only examines links in those pages that contain the topic text. Moving the for loop outside of the if statement will test the links for all pages.

The output follows:

1: [https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly]
2: [https://en.wikipedia.org/wiki/Bishop_Rock_Lighthouse]
3: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=717634231#Lighthouse]
4: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=717634231]
5: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716622943]
6: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716622943]
7: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716608512]
8: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716608512]
...
20: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716603919]

In this example, we did not save the results of the crawl in an external source. Normally this is necessary and can be stored in a file or database.

Using the crawler4j web crawler

Here we will illustrate the use of the crawler4j (https://github.com/yasserg/crawler4j) web crawler. We will use an adapted version of the basic crawler found at https://github.com/yasserg/crawler4j/tree/master/src/test/java/edu/uci/ics/crawler4j/examples/basic. We will create two classes: CrawlerController and SampleCrawler. The former class set ups the crawler while the latter contains the logic that controls what pages will be processed.

As with our previous crawler, we will crawl the Wikipedia article dealing with Bishop Rock. The results using this crawler will be smaller as many extraneous pages are ignored.

Let's look at the CrawlerController class first. There are several parameters that are used with the crawler as detailed here:

  • Crawl storage folder: The location where crawl data is stored
  • Number of crawlers: This controls the number of threads used for the crawl
  • Politeness delay: How many seconds to pause between requests
  • Crawl depth: How deep the crawl will go
  • Maximum number of pages to fetch: How many pages to fetch
  • Binary data: Whether to crawl binary data such as PDF files

The basic class is shown here:

public class CrawlerController { 
 
  public static void main(String[] args) throws Exception { 
    int numberOfCrawlers = 2; 
    CrawlConfig config = new CrawlConfig(); 
    String crawlStorageFolder = "data"; 
     
    config.setCrawlStorageFolder(crawlStorageFolder); 
    config.setPolitenessDelay(500); 
    config.setMaxDepthOfCrawling(2); 
    config.setMaxPagesToFetch(20); 
    config.setIncludeBinaryContentInCrawling(false); 
    ... 
  } 
}

Next, the CrawlController class is created and configured. Notice the RobotstxtConfig and RobotstxtServer classes used to handle robot.txt files. These files contain instructions that are intended to be read by a web crawler. They provide direction to help a crawler to do a better job such as specifying which parts of a site should not be crawled. This is useful for auto generated pages:

    PageFetcher pageFetcher = new PageFetcher(config); 
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
    RobotstxtServer robotstxtServer =  
        new RobotstxtServer(robotstxtConfig, pageFetcher); 
    CrawlController controller =  
        new CrawlController(config, pageFetcher, robotstxtServer); 

The crawler needs to start at one or more pages. The addSeed method adds the starting pages. While we used the method only once here, it can be used as many times as needed:

    controller.addSeed( 
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 

The start method will begin the crawling process:

    controller.start(SampleCrawler.class, numberOfCrawlers); 

The SampleCrawler class contains two methods of interest. The first is the shouldVisit method that determines whether a page will be visited and the visit method that actually handles the page. We start with the class declaration and the declaration of a Java regular expression class Pattern object. It will be one way of determining whether a page will be visited. In this declaration, standard images are specified and will be ignored:

    public class SampleCrawler extends WebCrawler { 
        private static final Pattern IMAGE_EXTENSIONS =  
            Pattern.compile(".*\\.(bmp|gif|jpg|png)$"); 
 
        ... 
    } 

The shouldVisit method is passed a reference to the page where this URL was found along with the URL. If any of the images match, the method returns false and the page is ignored. In addition, the URL must start with https://en.wikipedia.org/wiki/. We added this to restrict our searches to the Wikipedia website:

    public boolean shouldVisit(Page referringPage, WebURL url) { 
        String href = url.getURL().toLowerCase(); 
        if (IMAGE_EXTENSIONS.matcher(href).matches()) { 
            return false; 
        } 
        return href.startsWith("https://en.wikipedia.org/wiki/"); 
    }

The visit method is passed a Page object representing the page being visited. In this implementation, only those pages containing the string shipping route will be processed. This further restricts the pages visited. When we find such a page, its URL, Text, and Text length are displayed:

    public void visit(Page page) { 
        String url = page.getWebURL().getURL(); 
 
        if (page.getParseData() instanceof HtmlParseData) { 
            HtmlParseData htmlParseData =  
                (HtmlParseData) page.getParseData(); 
            String text = htmlParseData.getText(); 
            if (text.contains("shipping route")) { 
                out.println("\nURL: " + url); 
                out.println("Text: " + text); 
                out.println("Text length: " + text.length()); 
            } 
        } 
    } 

The following is the truncated output of the program when executed:

URL: https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly
Text: Bishop Rock, Isles of Scilly...From Wikipedia, the free encyclopedia ... Jump to: ... navigation, search For the Bishop Rock in the Pacific Ocean, see Cortes Bank. Bishop Rock Bishop Rock Lighthouse (2005)
...
Text length: 14677

Notice that only one page was returned. This web crawler was able to identify and ignore previous versions of the main web page.

We could perform further processing, but this example provides some insight into how the API works. Significant amounts of information can be obtained when visiting a page. In the example, we only used the URL and the length of the text. The following is a sample of other data that you may be interested in obtaining:

  • URL path
  • Parent URL
  • Anchor
  • HTML text
  • Outgoing links
  • Document ID

Web scraping in Java

Web scraping is the process of extracting information from a web page. The page is typically formatted using a series of HTML tags. An HTML parser is used to navigate through a page or series of pages and to access the page's data or metadata.

Jsoup (https://jsoup.org/) is an open source Java library that facilitates extracting and manipulating HTML documents using an HTML parser. It is used for a number of purposes, including web scraping, extracting specific elements from an HTML page, and cleaning up HTML documents.

There are several ways of obtaining an HTML document that may be useful. The HTML document can be extracted from a:

  • URL
  • String
  • File

The first approach is illustrated next where the Wikipedia page for data science is loaded into a Document object. This Jsoup object represents the HTML document. The connect method connects to the site and the get method retrieves the document:

    try { 
        Document document = Jsoup.connect( 
            "https://en.wikipedia.org/wiki/Data_science").get(); 
        ... 
     } catch (IOException ex) { 
        // Handle exception 
    } 

Loading from a file uses the File class as shown next. The overloaded parse method uses the file to create the document object:

    try { 
        File file = new File("Example.html"); 
        Document document = Jsoup.parse(file, "UTF-8", ""); 
        ... 
    } catch (IOException ex) { 
        // Handle exception 
    } 

The Example.html file follows:

<html> 
<head><title>Example Document</title></head> 
<body> 
<p>The body of the document</p> 
Interesting Links: 
<br> 
<a href="https://en.wikipedia.org/wiki/Data_science">Data Science</a> 
<br> 
<a href="https://en.wikipedia.org/wiki/Jsoup">Jsoup</a> 
<br> 
Images: 
<br> 
 <img src="eyechart.jpg" alt="Eye Chart">  
</body> 
</html> 

To create a Document object from a string, we will use the following sequence where the parse method processes the string that duplicates the previous HTML file:

    String html = "<html>\n" 
        + "<head><title>Example Document</title></head>\n" 
        + "<body>\n" 
        + "<p>The body of the document</p>\n" 
        + "Interesting Links:\n" 
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Data_science">" + 
          "DataScience</a>\n"
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Jsoup">" + 
          "Jsoup</a>\n"
        + "<br>\n" 
        + "Images:\n" 
        + "<br>\n" 
        + " <img src="eyechart.jpg" alt="Eye Chart"> \n"
        + "</body>\n" 
        + "</html>"; 
    Document document = Jsoup.parse(html);

The Document class possesses a number of useful methods. The title method returns the title. To get the text contents of the document, the select method is used. This method uses a string specifying the element of a document to retrieve:

    String title = document.title(); 
    out.println("Title: " + title); 
    Elements element = document.select("body"); 
    out.println("  Text: " + element.text()); 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Title: Data science - Wikipedia, the free encyclopedia
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Part of a 
...
policy About Wikipedia Disclaimers Contact Wikipedia Developers Cookie statement Mobile view

The parameter type of the select method is a string. By using a string, the type of information selected is easily changed. Details on how to formulate this string are found at the jsoup Javadocs for the Selector class at https://jsoup.org/apidocs/:

We can use the select method to retrieve the images in a document, as shown here:

    Elements images = document.select("img[src$=.png]"); 
    for (Element image : images) { 
        out.println("\nImage: " + image); 
    } 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Image: <img alt="Data Visualization" src="//upload.wikimedia.org/...>
Image: <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/...>

Links can be easily retrieved as shown next:

    Elements links = document.select("a[href]"); 
    for (Element link : links) { 
        out.println("Link: " + link.attr("href") 
            + " Text: " + link.text()); 
    } 

The output for the Example.html page is shown here:

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science
Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup

jsoup possesses many additional capabilities. However, this example demonstrates the web scraping process. There are also other Java HTML parsers available. A comparison of Java HTML parser, among others, can be found at https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Creating your own web crawler

Now that we have a basic understanding of web crawlers, we are ready to create our own. In this simple web crawler, we will keep track of the pages visited using ArrayList instances. In addition, jsoup will be used to parse a web page and we will limit the number of pages we visit. Jsoup (https://jsoup.org/) is an open source HTML parser. This example demonstrates the basic structure of a web crawler and also highlights some of the issues involved in creating a web crawler.

We will use the SimpleWebCrawler class, as declared here:

public class SimpleWebCrawler { 
 
    private String topic; 
    private String startingURL; 
    private String urlLimiter; 
    private final int pageLimit = 20; 
    private ArrayList<String> visitedList = new ArrayList<>(); 
    private ArrayList<String> pageList = new ArrayList<>(); 
    ... 
    public static void main(String[] args) { 
        new SimpleWebCrawler(); 
    } 
 
} 

The instance variables are detailed here:

Variable

Use

topic

The keyword that needs to be in a page for the page to be accepted

startingURL

The URL of the first page

urlLimiter

A string that must be contained in a link before it will be followed

pageLimit

The maximum number of pages to retrieve

visitedList

The ArrayList containing pages that have already been visited

pageList

An ArrayList containing the URLs of the pages of interest

In the SimpleWebCrawler constructor, we initialize the instance variables to begin the search from the Wikipedia page for Bishop Rock, an island off the coast of Italy. This was chosen to minimize the number of pages that might be retrieved. As we will see, there are many more Wikipedia pages dealing with Bishop Rock than one might think.

The urlLimiter variable is set to Bishop_Rock, which will restrict the embedded links to follow to just those containing that string. Each page of interest must contain the value stored in the topic variable. The visitPage method performs the actual crawl:

    public SimpleWebCrawler() { 
        startingURL = https://en.wikipedia.org/wiki/Bishop_Rock, " 
            + "Isles_of_Scilly"; 
        urlLimiter = "Bishop_Rock"; 
        topic = "shipping route"; 
        visitPage(startingURL); 
    } 

In the visitPage method, the pageList ArrayList is checked to see whether the maximum number of accepted pages has been exceeded. If the limit has been exceeded, then the search terminates:

    public void visitPage(String url) { 
        if (pageList.size() >= pageLimit) { 
            return; 
        } 
       ... 
    } 

If the page has already been visited, then we ignore it. Otherwise, it is added to the visited list:

    if (visitedList.contains(url)) { 
        // URL already visited 
    } else { 
        visitedList.add(url); 
            ... 
    } 

Jsoup is used to parse the page and return a Document object. There are many different exceptions and problems that can occur such as a malformed URL, retrieval timeouts, or simply bad links. The catch block needs to handle these types of problems. We will provide a more in-depth explanation of jsoup in web scraping in Java:

    try { 
        Document doc = Jsoup.connect(url).get(); 
            ... 
        } 
    } catch (Exception ex) { 
        // Handle exceptions 
    } 

If the document contains the topic text, then the link is displayed and added to the pageList ArrayList. Each embedded link is obtained, and if the link contains the limiting text, then the visitPage method is called recursively:

    if (doc.text().contains(topic)) { 
        out.println((pageList.size() + 1) + ": [" + url + "]"); 
        pageList.add(url); 
 
        // Process page links 
        Elements questions = doc.select("a[href]"); 
        for (Element link : questions) { 
            if (link.attr("href").contains(urlLimiter)) { 
                visitPage(link.attr("abs:href")); 
            } 
        } 
    } 

This approach only examines links in those pages that contain the topic text. Moving the for loop outside of the if statement will test the links for all pages.

The output follows:

1: [https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly]
2: [https://en.wikipedia.org/wiki/Bishop_Rock_Lighthouse]
3: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=717634231#Lighthouse]
4: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=717634231]
5: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716622943]
6: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716622943]
7: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&oldid=716608512]
8: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716608512]
...
20: [https://en.wikipedia.org/w/index.php?title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716603919]

In this example, we did not save the results of the crawl in an external source. Normally this is necessary and can be stored in a file or database.

Using the crawler4j web crawler

Here we will illustrate the use of the crawler4j (https://github.com/yasserg/crawler4j) web crawler. We will use an adapted version of the basic crawler found at https://github.com/yasserg/crawler4j/tree/master/src/test/java/edu/uci/ics/crawler4j/examples/basic. We will create two classes: CrawlerController and SampleCrawler. The former class set ups the crawler while the latter contains the logic that controls what pages will be processed.

As with our previous crawler, we will crawl the Wikipedia article dealing with Bishop Rock. The results using this crawler will be smaller as many extraneous pages are ignored.

Let's look at the CrawlerController class first. There are several parameters that are used with the crawler as detailed here:

  • Crawl storage folder: The location where crawl data is stored
  • Number of crawlers: This controls the number of threads used for the crawl
  • Politeness delay: How many seconds to pause between requests
  • Crawl depth: How deep the crawl will go
  • Maximum number of pages to fetch: How many pages to fetch
  • Binary data: Whether to crawl binary data such as PDF files

The basic class is shown here:

public class CrawlerController { 
 
  public static void main(String[] args) throws Exception { 
    int numberOfCrawlers = 2; 
    CrawlConfig config = new CrawlConfig(); 
    String crawlStorageFolder = "data"; 
     
    config.setCrawlStorageFolder(crawlStorageFolder); 
    config.setPolitenessDelay(500); 
    config.setMaxDepthOfCrawling(2); 
    config.setMaxPagesToFetch(20); 
    config.setIncludeBinaryContentInCrawling(false); 
    ... 
  } 
}

Next, the CrawlController class is created and configured. Notice the RobotstxtConfig and RobotstxtServer classes used to handle robot.txt files. These files contain instructions that are intended to be read by a web crawler. They provide direction to help a crawler to do a better job such as specifying which parts of a site should not be crawled. This is useful for auto generated pages:

    PageFetcher pageFetcher = new PageFetcher(config); 
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
    RobotstxtServer robotstxtServer =  
        new RobotstxtServer(robotstxtConfig, pageFetcher); 
    CrawlController controller =  
        new CrawlController(config, pageFetcher, robotstxtServer); 

The crawler needs to start at one or more pages. The addSeed method adds the starting pages. While we used the method only once here, it can be used as many times as needed:

    controller.addSeed( 
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 

The start method will begin the crawling process:

    controller.start(SampleCrawler.class, numberOfCrawlers); 

The SampleCrawler class contains two methods of interest. The first is the shouldVisit method that determines whether a page will be visited and the visit method that actually handles the page. We start with the class declaration and the declaration of a Java regular expression class Pattern object. It will be one way of determining whether a page will be visited. In this declaration, standard images are specified and will be ignored:

    public class SampleCrawler extends WebCrawler { 
        private static final Pattern IMAGE_EXTENSIONS =  
            Pattern.compile(".*\\.(bmp|gif|jpg|png)$"); 
 
        ... 
    } 

The shouldVisit method is passed a reference to the page where this URL was found along with the URL. If any of the images match, the method returns false and the page is ignored. In addition, the URL must start with https://en.wikipedia.org/wiki/. We added this to restrict our searches to the Wikipedia website:

    public boolean shouldVisit(Page referringPage, WebURL url) { 
        String href = url.getURL().toLowerCase(); 
        if (IMAGE_EXTENSIONS.matcher(href).matches()) { 
            return false; 
        } 
        return href.startsWith("https://en.wikipedia.org/wiki/"); 
    }

The visit method is passed a Page object representing the page being visited. In this implementation, only those pages containing the string shipping route will be processed. This further restricts the pages visited. When we find such a page, its URL, Text, and Text length are displayed:

    public void visit(Page page) { 
        String url = page.getWebURL().getURL(); 
 
        if (page.getParseData() instanceof HtmlParseData) { 
            HtmlParseData htmlParseData =  
                (HtmlParseData) page.getParseData(); 
            String text = htmlParseData.getText(); 
            if (text.contains("shipping route")) { 
                out.println("\nURL: " + url); 
                out.println("Text: " + text); 
                out.println("Text length: " + text.length()); 
            } 
        } 
    } 

The following is the truncated output of the program when executed:

URL: https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly
Text: Bishop Rock, Isles of Scilly...From Wikipedia, the free encyclopedia ... Jump to: ... navigation, search For the Bishop Rock in the Pacific Ocean, see Cortes Bank. Bishop Rock Bishop Rock Lighthouse (2005)
...
Text length: 14677

Notice that only one page was returned. This web crawler was able to identify and ignore previous versions of the main web page.

We could perform further processing, but this example provides some insight into how the API works. Significant amounts of information can be obtained when visiting a page. In the example, we only used the URL and the length of the text. The following is a sample of other data that you may be interested in obtaining:

  • URL path
  • Parent URL
  • Anchor
  • HTML text
  • Outgoing links
  • Document ID
Web scraping in Java

Web scraping is the process of extracting information from a web page. The page is typically formatted using a series of HTML tags. An HTML parser is used to navigate through a page or series of pages and to access the page's data or metadata.

Jsoup (https://jsoup.org/) is an open source Java library that facilitates extracting and manipulating HTML documents using an HTML parser. It is used for a number of purposes, including web scraping, extracting specific elements from an HTML page, and cleaning up HTML documents.

There are several ways of obtaining an HTML document that may be useful. The HTML document can be extracted from a:

  • URL
  • String
  • File

The first approach is illustrated next where the Wikipedia page for data science is loaded into a Document object. This Jsoup object represents the HTML document. The connect method connects to the site and the get method retrieves the document:

    try { 
        Document document = Jsoup.connect( 
            "https://en.wikipedia.org/wiki/Data_science").get(); 
        ... 
     } catch (IOException ex) { 
        // Handle exception 
    } 

Loading from a file uses the File class as shown next. The overloaded parse method uses the file to create the document object:

    try { 
        File file = new File("Example.html"); 
        Document document = Jsoup.parse(file, "UTF-8", ""); 
        ... 
    } catch (IOException ex) { 
        // Handle exception 
    } 

The Example.html file follows:

<html> 
<head><title>Example Document</title></head> 
<body> 
<p>The body of the document</p> 
Interesting Links: 
<br> 
<a href="https://en.wikipedia.org/wiki/Data_science">Data Science</a> 
<br> 
<a href="https://en.wikipedia.org/wiki/Jsoup">Jsoup</a> 
<br> 
Images: 
<br> 
 <img src="eyechart.jpg" alt="Eye Chart">  
</body> 
</html> 

To create a Document object from a string, we will use the following sequence where the parse method processes the string that duplicates the previous HTML file:

    String html = "<html>\n" 
        + "<head><title>Example Document</title></head>\n" 
        + "<body>\n" 
        + "<p>The body of the document</p>\n" 
        + "Interesting Links:\n" 
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Data_science">" + 
          "DataScience</a>\n"
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Jsoup">" + 
          "Jsoup</a>\n"
        + "<br>\n" 
        + "Images:\n" 
        + "<br>\n" 
        + " <img src="eyechart.jpg" alt="Eye Chart"> \n"
        + "</body>\n" 
        + "</html>"; 
    Document document = Jsoup.parse(html);

The Document class possesses a number of useful methods. The title method returns the title. To get the text contents of the document, the select method is used. This method uses a string specifying the element of a document to retrieve:

    String title = document.title(); 
    out.println("Title: " + title); 
    Elements element = document.select("body"); 
    out.println("  Text: " + element.text()); 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Title: Data science - Wikipedia, the free encyclopedia
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Part of a 
...
policy About Wikipedia Disclaimers Contact Wikipedia Developers Cookie statement Mobile view

The parameter type of the select method is a string. By using a string, the type of information selected is easily changed. Details on how to formulate this string are found at the jsoup Javadocs for the Selector class at https://jsoup.org/apidocs/:

We can use the select method to retrieve the images in a document, as shown here:

    Elements images = document.select("img[src$=.png]"); 
    for (Element image : images) { 
        out.println("\nImage: " + image); 
    } 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Image: <img alt="Data Visualization" src="//upload.wikimedia.org/...>
Image: <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/...>

Links can be easily retrieved as shown next:

    Elements links = document.select("a[href]"); 
    for (Element link : links) { 
        out.println("Link: " + link.attr("href") 
            + " Text: " + link.text()); 
    } 

The output for the Example.html page is shown here:

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science
Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup

jsoup possesses many additional capabilities. However, this example demonstrates the web scraping process. There are also other Java HTML parsers available. A comparison of Java HTML parser, among others, can be found at https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Using the crawler4j web crawler

Here we will illustrate the use of the crawler4j (https://github.com/yasserg/crawler4j) web crawler. We will use an adapted version of the basic crawler found at https://github.com/yasserg/crawler4j/tree/master/src/test/java/edu/uci/ics/crawler4j/examples/basic. We will create two classes: CrawlerController and SampleCrawler. The former class set ups the crawler while the latter contains the logic that controls what pages will be processed.

As with our previous crawler, we will crawl the Wikipedia article dealing with Bishop Rock. The results using this crawler will be smaller as many extraneous pages are ignored.

Let's look at the CrawlerController class first. There are several parameters that are used with the crawler as detailed here:

  • Crawl storage folder: The location where crawl data is stored
  • Number of crawlers: This controls the number of threads used for the crawl
  • Politeness delay: How many seconds to pause between requests
  • Crawl depth: How deep the crawl will go
  • Maximum number of pages to fetch: How many pages to fetch
  • Binary data: Whether to crawl binary data such as PDF files

The basic class is shown here:

public class CrawlerController { 
 
  public static void main(String[] args) throws Exception { 
    int numberOfCrawlers = 2; 
    CrawlConfig config = new CrawlConfig(); 
    String crawlStorageFolder = "data"; 
     
    config.setCrawlStorageFolder(crawlStorageFolder); 
    config.setPolitenessDelay(500); 
    config.setMaxDepthOfCrawling(2); 
    config.setMaxPagesToFetch(20); 
    config.setIncludeBinaryContentInCrawling(false); 
    ... 
  } 
}

Next, the CrawlController class is created and configured. Notice the RobotstxtConfig and RobotstxtServer classes used to handle robot.txt files. These files contain instructions that are intended to be read by a web crawler. They provide direction to help a crawler to do a better job such as specifying which parts of a site should not be crawled. This is useful for auto generated pages:

    PageFetcher pageFetcher = new PageFetcher(config); 
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
    RobotstxtServer robotstxtServer =  
        new RobotstxtServer(robotstxtConfig, pageFetcher); 
    CrawlController controller =  
        new CrawlController(config, pageFetcher, robotstxtServer); 

The crawler needs to start at one or more pages. The addSeed method adds the starting pages. While we used the method only once here, it can be used as many times as needed:

    controller.addSeed( 
      "https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly"); 

The start method will begin the crawling process:

    controller.start(SampleCrawler.class, numberOfCrawlers); 

The SampleCrawler class contains two methods of interest. The first is the shouldVisit method that determines whether a page will be visited and the visit method that actually handles the page. We start with the class declaration and the declaration of a Java regular expression class Pattern object. It will be one way of determining whether a page will be visited. In this declaration, standard images are specified and will be ignored:

    public class SampleCrawler extends WebCrawler { 
        private static final Pattern IMAGE_EXTENSIONS =  
            Pattern.compile(".*\\.(bmp|gif|jpg|png)$"); 
 
        ... 
    } 

The shouldVisit method is passed a reference to the page where this URL was found along with the URL. If any of the images match, the method returns false and the page is ignored. In addition, the URL must start with https://en.wikipedia.org/wiki/. We added this to restrict our searches to the Wikipedia website:

    public boolean shouldVisit(Page referringPage, WebURL url) { 
        String href = url.getURL().toLowerCase(); 
        if (IMAGE_EXTENSIONS.matcher(href).matches()) { 
            return false; 
        } 
        return href.startsWith("https://en.wikipedia.org/wiki/"); 
    }

The visit method is passed a Page object representing the page being visited. In this implementation, only those pages containing the string shipping route will be processed. This further restricts the pages visited. When we find such a page, its URL, Text, and Text length are displayed:

    public void visit(Page page) { 
        String url = page.getWebURL().getURL(); 
 
        if (page.getParseData() instanceof HtmlParseData) { 
            HtmlParseData htmlParseData =  
                (HtmlParseData) page.getParseData(); 
            String text = htmlParseData.getText(); 
            if (text.contains("shipping route")) { 
                out.println("\nURL: " + url); 
                out.println("Text: " + text); 
                out.println("Text length: " + text.length()); 
            } 
        } 
    } 

The following is the truncated output of the program when executed:

URL: https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly
Text: Bishop Rock, Isles of Scilly...From Wikipedia, the free encyclopedia ... Jump to: ... navigation, search For the Bishop Rock in the Pacific Ocean, see Cortes Bank. Bishop Rock Bishop Rock Lighthouse (2005)
...
Text length: 14677

Notice that only one page was returned. This web crawler was able to identify and ignore previous versions of the main web page.

We could perform further processing, but this example provides some insight into how the API works. Significant amounts of information can be obtained when visiting a page. In the example, we only used the URL and the length of the text. The following is a sample of other data that you may be interested in obtaining:

  • URL path
  • Parent URL
  • Anchor
  • HTML text
  • Outgoing links
  • Document ID
Web scraping in Java

Web scraping is the process of extracting information from a web page. The page is typically formatted using a series of HTML tags. An HTML parser is used to navigate through a page or series of pages and to access the page's data or metadata.

Jsoup (https://jsoup.org/) is an open source Java library that facilitates extracting and manipulating HTML documents using an HTML parser. It is used for a number of purposes, including web scraping, extracting specific elements from an HTML page, and cleaning up HTML documents.

There are several ways of obtaining an HTML document that may be useful. The HTML document can be extracted from a:

  • URL
  • String
  • File

The first approach is illustrated next where the Wikipedia page for data science is loaded into a Document object. This Jsoup object represents the HTML document. The connect method connects to the site and the get method retrieves the document:

    try { 
        Document document = Jsoup.connect( 
            "https://en.wikipedia.org/wiki/Data_science").get(); 
        ... 
     } catch (IOException ex) { 
        // Handle exception 
    } 

Loading from a file uses the File class as shown next. The overloaded parse method uses the file to create the document object:

    try { 
        File file = new File("Example.html"); 
        Document document = Jsoup.parse(file, "UTF-8", ""); 
        ... 
    } catch (IOException ex) { 
        // Handle exception 
    } 

The Example.html file follows:

<html> 
<head><title>Example Document</title></head> 
<body> 
<p>The body of the document</p> 
Interesting Links: 
<br> 
<a href="https://en.wikipedia.org/wiki/Data_science">Data Science</a> 
<br> 
<a href="https://en.wikipedia.org/wiki/Jsoup">Jsoup</a> 
<br> 
Images: 
<br> 
 <img src="eyechart.jpg" alt="Eye Chart">  
</body> 
</html> 

To create a Document object from a string, we will use the following sequence where the parse method processes the string that duplicates the previous HTML file:

    String html = "<html>\n" 
        + "<head><title>Example Document</title></head>\n" 
        + "<body>\n" 
        + "<p>The body of the document</p>\n" 
        + "Interesting Links:\n" 
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Data_science">" + 
          "DataScience</a>\n"
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Jsoup">" + 
          "Jsoup</a>\n"
        + "<br>\n" 
        + "Images:\n" 
        + "<br>\n" 
        + " <img src="eyechart.jpg" alt="Eye Chart"> \n"
        + "</body>\n" 
        + "</html>"; 
    Document document = Jsoup.parse(html);

The Document class possesses a number of useful methods. The title method returns the title. To get the text contents of the document, the select method is used. This method uses a string specifying the element of a document to retrieve:

    String title = document.title(); 
    out.println("Title: " + title); 
    Elements element = document.select("body"); 
    out.println("  Text: " + element.text()); 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Title: Data science - Wikipedia, the free encyclopedia
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Part of a 
...
policy About Wikipedia Disclaimers Contact Wikipedia Developers Cookie statement Mobile view

The parameter type of the select method is a string. By using a string, the type of information selected is easily changed. Details on how to formulate this string are found at the jsoup Javadocs for the Selector class at https://jsoup.org/apidocs/:

We can use the select method to retrieve the images in a document, as shown here:

    Elements images = document.select("img[src$=.png]"); 
    for (Element image : images) { 
        out.println("\nImage: " + image); 
    } 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Image: <img alt="Data Visualization" src="//upload.wikimedia.org/...>
Image: <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/...>

Links can be easily retrieved as shown next:

    Elements links = document.select("a[href]"); 
    for (Element link : links) { 
        out.println("Link: " + link.attr("href") 
            + " Text: " + link.text()); 
    } 

The output for the Example.html page is shown here:

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science
Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup

jsoup possesses many additional capabilities. However, this example demonstrates the web scraping process. There are also other Java HTML parsers available. A comparison of Java HTML parser, among others, can be found at https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Web scraping in Java

Web scraping is the process of extracting information from a web page. The page is typically formatted using a series of HTML tags. An HTML parser is used to navigate through a page or series of pages and to access the page's data or metadata.

Jsoup (https://jsoup.org/) is an open source Java library that facilitates extracting and manipulating HTML documents using an HTML parser. It is used for a number of purposes, including web scraping, extracting specific elements from an HTML page, and cleaning up HTML documents.

There are several ways of obtaining an HTML document that may be useful. The HTML document can be extracted from a:

  • URL
  • String
  • File

The first approach is illustrated next where the Wikipedia page for data science is loaded into a Document object. This Jsoup object represents the HTML document. The connect method connects to the site and the get method retrieves the document:

    try { 
        Document document = Jsoup.connect( 
            "https://en.wikipedia.org/wiki/Data_science").get(); 
        ... 
     } catch (IOException ex) { 
        // Handle exception 
    } 

Loading from a file uses the File class as shown next. The overloaded parse method uses the file to create the document object:

    try { 
        File file = new File("Example.html"); 
        Document document = Jsoup.parse(file, "UTF-8", ""); 
        ... 
    } catch (IOException ex) { 
        // Handle exception 
    } 

The Example.html file follows:

<html> 
<head><title>Example Document</title></head> 
<body> 
<p>The body of the document</p> 
Interesting Links: 
<br> 
<a href="https://en.wikipedia.org/wiki/Data_science">Data Science</a> 
<br> 
<a href="https://en.wikipedia.org/wiki/Jsoup">Jsoup</a> 
<br> 
Images: 
<br> 
 <img src="eyechart.jpg" alt="Eye Chart">  
</body> 
</html> 

To create a Document object from a string, we will use the following sequence where the parse method processes the string that duplicates the previous HTML file:

    String html = "<html>\n" 
        + "<head><title>Example Document</title></head>\n" 
        + "<body>\n" 
        + "<p>The body of the document</p>\n" 
        + "Interesting Links:\n" 
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Data_science">" + 
          "DataScience</a>\n"
        + "<br>\n" 
        + "<a href="https://en.wikipedia.org/wiki/Jsoup">" + 
          "Jsoup</a>\n"
        + "<br>\n" 
        + "Images:\n" 
        + "<br>\n" 
        + " <img src="eyechart.jpg" alt="Eye Chart"> \n"
        + "</body>\n" 
        + "</html>"; 
    Document document = Jsoup.parse(html);

The Document class possesses a number of useful methods. The title method returns the title. To get the text contents of the document, the select method is used. This method uses a string specifying the element of a document to retrieve:

    String title = document.title(); 
    out.println("Title: " + title); 
    Elements element = document.select("body"); 
    out.println("  Text: " + element.text()); 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Title: Data science - Wikipedia, the free encyclopedia
Text: Data science From Wikipedia, the free encyclopedia Jump to: navigation, search Not to be confused with information science. Part of a 
...
policy About Wikipedia Disclaimers Contact Wikipedia Developers Cookie statement Mobile view

The parameter type of the select method is a string. By using a string, the type of information selected is easily changed. Details on how to formulate this string are found at the jsoup Javadocs for the Selector class at https://jsoup.org/apidocs/:

We can use the select method to retrieve the images in a document, as shown here:

    Elements images = document.select("img[src$=.png]"); 
    for (Element image : images) { 
        out.println("\nImage: " + image); 
    } 

The output for the Wikipedia data science page is shown here. It has been shortened to conserve space:

Image: <img alt="Data Visualization" src="//upload.wikimedia.org/...>
Image: <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/ba/...>

Links can be easily retrieved as shown next:

    Elements links = document.select("a[href]"); 
    for (Element link : links) { 
        out.println("Link: " + link.attr("href") 
            + " Text: " + link.text()); 
    } 

The output for the Example.html page is shown here:

Link: https://en.wikipedia.org/wiki/Data_science Text: Data Science
Link: https://en.wikipedia.org/wiki/Jsoup Text: Jsoup

jsoup possesses many additional capabilities. However, this example demonstrates the web scraping process. There are also other Java HTML parsers available. A comparison of Java HTML parser, among others, can be found at https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Using API calls to access common social media sites

Social media contain a wealth of information that can be processed and is used by many data analysis applications. In this section, we will illustrate how to access a few of these sources using their Java APIs. Most of them require some sort of access key, which is normally easy to obtain. We start with a discussion on the OAuth class, which provides one approach to authenticating access to a data source.

When working with the type of data source, it is important to keep in mind that the data is not always public. While it may be accessible, the owner of the data may be an individual who does not necessarily want the information shared. Most APIs provide a means to determine how the data can be distributed, and these requests should be honored. When private information is used, permission from the author must be obtained.

In addition, these sites have limits on the number of requests that can be made. Keep this in mind when pulling data from a site. If these limits need to be exceeded, then most sites provide a way of doing this.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Using OAuth to authenticate users

OAuth is an open standard used to authenticate users to many different websites. A resource owner effectively delegates access to a server resource without having to share their credentials. It works over HTTPS. OAuth 2.0 succeeded OAuth and is not backwards compatible. It provides client developers a simple way of providing authentication. Several companies use OAuth 2.0 including PayPal, Comcast, and Blizzard Entertainment.

A list of OAuth 2.0 providers is found at https://en.wikipedia.org/wiki/List_of_OAuth_providers. We will use several of these in our discussions.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Handing Twitter

The sheer volume of data and the popularity of the site, among celebrities and the general public alike, make Twitter a valuable resource for mining social media data. Twitter is a popular social media platform allowing users to read and post short messages called tweets. Twitter provides API support for posting and pulling tweets, including streaming data from all public users. While there are services available for pulling the entire set of public tweet data, we are going to examine other options that, while limiting in the amount of data retrieved at one time, are available at no cost.

We are going to focus on the Twitter API for retrieving streaming data. There are other options for retrieving tweets from a specific user as well as posting data to a specific account but we will not be addressing those in this chapter. The public stream API, at the default access level, allows the user to pull a sample of public tweets currently streaming on Twitter. It is possible to refine the data by specifying parameters to track keywords, specific users, and location.

We are going to use HBC, a Java HTTP client, for this example. You can download a sample HBC application at https://github.com/twitter/hbc. If you prefer to use a different HTTP client, ensure it will return incremental response data. The Apache HTTP client is one option. Before you can create the HTTP connection, you must first create a Twitter account and an application within that account. To get started with the app, visit apps.twitter.com. Once your app is created, you will be assigned a consumer key, consumer secret, access token, and access secret token. We will also use OAuth, as discussed previously in this chapter.

First, we will write a method to perform the authentication and request data from Twitter. The parameters for our method are the authentication information given to us by Twitter when we created our app. We will create a BlockingQueue object to hold our streaming data. For this example, we will set a default capacity of 10,000. We will also specify our endpoint and turn off stall warnings:

    public static void streamTwitter( 
        String consumerKey, String consumerSecret,  
        String accessToken, String accessSecret)  
            throws InterruptedException { 
 
        BlockingQueue<String> statusQueue =  
            new LinkedBlockingQueue<String>(10000); 
        StatusesSampleEndpoint ending =  
            new StatusesSampleEndpoint(); 
        ending.stallWarnings(false); 
        ... 
    } 

Next, we create an Authentication object using OAuth1, a variation of the OAuth class. We can then build our connection client and complete the HTTP connection:

    Authentication twitterAuth = new OAuth1(consumerKey,  
        consumerSecret, accessToken, accessSecret); 
    BasicClient twitterClient = new ClientBuilder() 
            .name("Twitter client") 
            .hosts(Constants.STREAM_HOST) 
            .endpoint(ending) 
            .authentication(twitterAuth) 
            .processor(new StringDelimitedProcessor(statusQueue)) 
            .build(); 
    twitterClient.connect(); 

For the purposes of this example, we will simply read the messages received from the stream and print them to the screen. The messages are returned in JSON format and the decision of how to process them in a real application will depend upon the purpose and limitations of that application:

    for (int msgRead = 0; msgRead < 1000; msgRead++) { 
      if (twitterClient.isDone()) { 
        out.println(twitterClient.getExitEvent().getMessage()); 
        break; 
      } 
 
      String msg = statusQueue.poll(10, TimeUnit.SECONDS); 
      if (msg == null) { 
        out.println("Waited 10 seconds - no message received"); 
      } else { 
        out.println(msg); 
      } 
    } 
    twitterClient.stop(); 

To execute our method, we simply pass our authentication information to the streamTwitter method. For security purposes, we have replaced our personal keys here. Authentication information should always be protected:

    public static void main(String[] args) { 
   
      try { 
        SampleStreamExample.streamTwitter( 
            myKey, mySecret, myToken, myAccess);  
      } catch (InterruptedException e) { 
        out.println(e); 
      } 
    } 

Here is truncated sample data retrieved using the methods listed above. Your data will vary based upon Twitter's live stream, but it should resemble this example:

{"created_at":"Fri May 20 15:47:21 +0000 2016","id":733685552789098496,"id_str":"733685552789098496","text":"bwisit si em bahala sya","source":"\u003ca href="http:\/\/twitter.com" rel="nofollow"\u003eTwitter Web 
...
ntions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"tl","timestamp_ms":"1463759241660"}

Twitter also provides support for pulling all data for one specific user account, as well as posting data directly to an account. A REST API is also available and provides support for specific queries via the search API. These also use the OAuth standard and return data in JSON files.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Handling Wikipedia

Wikipedia (https://www.wikipedia.org/) is a useful source of text and image type information. It is an Internet encyclopedia that hosts 38 million articles written in over 250 languages (https://en.wikipedia.org/wiki/Wikipedia). As such, it is useful to know how to programmatically access its contents.

MediaWiki is an open source wiki application that supports wiki type sites. It is used to support Wikipedia and many other sites. The MediaWiki API (http://www.mediawiki.org/wiki/API) provides access to a wiki's data and metadata over HTTP. An application, using this API, can log in, read data, and post changes to a site.

There are several Java APIs that support programmatic access to a wiki site as listed at https://www.mediawiki.org/wiki/API:Client_code#Java. To demonstrate Java access to a wiki we will use Bliki found at https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home. It provides good access and is easy to use for most basic operations.

The MediaWiki API is complex and has many features. The intent of this section is to illustrate the basic process of obtaining text from a Wikipedia article using this API. It is not possible to cover the API completely here.

We will use the following classes from the info.bliki.api and info.bliki.wiki.model packages:

  • Page: Represents a retrieved page
  • User: Represents a user
  • WikiModel: Represents the wiki

Javadocs for Bliki are found at http://www.javadoc.io/doc/info.bliki.wiki/bliki-core/3.1.0.

The following example has been adapted from http://www.integratingstuff.com/2012/04/06/hook-into-wikipedia-using-java-and-the-mediawiki-api/. This example will access the English Wikipedia page for the subject, data science. We start by creating an instance of the User class. The first two arguments of the three-argument constructor are the user ID and password, respectively. In this case, they are empty strings. This combination allows us to read a page without having to set up an account. The third argument is the URL for the MediaWiki API page:

    User user = new User("", "",  
        "http://en.wikipedia.org/w/api.php"); 
    user.login(); 

An account will enable us to modify the document. The queryContent method returns a list of Page objects for the subjects found in a string array. Each string should be the title of a page. In this example, we access a single page:

    String[] titles = {"Data science"}; 
    List<Page> pageList = user.queryContent(titles); 

Each Page object contains the content of a page. There are several methods that will return the contents of the page. For each page, a WikiModel instance is created using the two-argument constructor. The first argument is the image base URL and the second argument is the link base URL. These URLs use Wiki variables called image and title, which will be replaced when creating links:

    for (Page page : pageList) { 
        WikiModel wikiModel = new WikiModel("${image}",  
            "${title}"); 
        ... 
    } 

The render method will take the wiki page and render it to HTML. There is also a method to render the page to a PDF document:

    String htmlText = wikiModel.render(page.toString()); 

The HTML text is then displayed:

    out.println(htmlText); 

A partial listing of the output follows:

<p>PageID: 35458904; NS: 0; Title: Data science; 
Image url: 
Content:
{{distinguish}}
{{Use dmy dates}}
{{Data Visualization}}</p>
<p><b>Data science</b> is an interdisciplinary field about processes and systems to extract <a href="Knowledge" >knowledge</a> 
...

We can also obtain basic information about the article using one of several methods as shown here:

    out.println("Title: " + page.getTitle() + "\n" + 
        "Page ID: " + page.getPageid() + "\n" + 
        "Timestamp: " + page.getCurrentRevision().getTimestamp()); 

It is also possible to obtain a list of references in the article and a list of the headers. Here, a list of the references is displayed:

    List <Reference> referenceList = wikiModel.getReferences(); 
    out.println(referenceList.size()); 
    for(Reference reference : referenceList) { 
        out.println(reference.getRefString()); 
    } 

The following illustrates the process of getting the section headers:

    ITableOfContent toc = wikiModel.getTableOfContent(); 
    List<SectionHeader> sections = toc.getSectionHeaders(); 
    for(SectionHeader sh : sections) { 
        out.println(sh.getFirst()); 
    } 

The entire content of Wikipedia can be downloaded. This process is discussed at https://en.wikipedia.org/wiki/Wikipedia:Database_download.

It may be desirable to set up your own Wikipedia server to handle your request.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Handling Flickr

Flickr (https://www.flickr.com/) is an online photo management and sharing application. It is a possible source for images and videos. The Flickr Developer Guide (https://www.flickr.com/services/developer/) is a good starting point to learn more about Flickr's API.

One of the first steps to using the Flickr API is to request an API key. This key is used to sign your API requests. The process to obtain a key starts at https://www.flickr.com/services/apps/create/. Both commercial and noncommercial keys are available. When you obtain a key you will also get a "secret." Both of these are required to use the API.

We will illustrate the process of locating and downloading images from Flickr. The process involves:

  • Creating a Flickr class instance
  • Specifying the search parameters for a query
  • Performing the search
  • Downloading the image

A FlickrException or IOException may be thrown during this process. There are several APIs that support Flickr access. We will be using Flickr4Java, found at https://github.com/callmeal/Flickr4Java. The Flickr4Java Javadocs is found at http://flickrj.sourceforge.net/api/. We will start with a try block and the apikey and secret declarations:

    try { 
        String apikey = "Your API key"; 
        String secret = "Your secret"; 
 
    } catch (FlickrException | IOException ex) { 
        // Handle exceptions 
    } 

The Flickr instance is created next, where the apikey and secret are supplied as the first two parameters. The last parameter specifies the transfer technique used to access Flickr servers. Currently, the REST transport is supported using the REST class:

    Flickr flickr = new Flickr(apikey, secret, new REST()); 

To search for images, we will use the SearchParameters class. This class supports a number of criteria that will narrow down the number of images returned from a query and includes such criteria as latitude, longitude, media type, and user ID. In the following sequence, the setBBox method specifies the longitude and latitude for the search. The parameters are (in order): minimum longitude, minimum latitude, maximum longitude, and maximum latitude. The setMedia method specifies the type of media. There are three possible arguments — "all", "photos", and "videos":

    SearchParameters searchParameters = new SearchParameters(); 
    searchParameters.setBBox("-180", "-90", "180", "90"); 
    searchParameters.setMedia("photos"); 

The PhotosInterface class possesses a search method that uses the SearchParameters instance to retrieve a list of photos. The getPhotosInterface method returns an instance of the PhotosInterface class, as shown next. The SearchParameters instance is the first parameter. The second parameter determines how many photos are retrieved per page and the third parameter is the offset. A PhotoList class instance is returned:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 
    PhotoList<Photo> list = pi.search(searchParameters, 10, 0); 

The next sequence illustrates the use of several methods to get information about the images retrieved. Each Photo instance is accessed using the get method. The title, image format, public flag, and photo URL are displayed:

    out.println("Image List"); 
    for (int i = 0; i < list.size(); i++) { 
        Photo photo = list.get(i); 
        out.println("Image: " + i + 
            `"\nTitle: " + photo.getTitle() +  
            "\nMedia: " + photo.getOriginalFormat() + 
            "\nPublic: " + photo.isPublicFlag() + 
            "\nUrl: " + photo.getUrl() + 
            "\n"); 
    } 
    out.println(); 

A partial listing is shown here where many of the specific values have been modified to protect the original data:

Image List
Image: 0
Title: XYZ Image
Media: jpg
Public: true
Url: https://flickr.com/photos/7723...@N02/269...
Image: 1
Title: IMG_5555.jpg
Media: jpg
Public: true
Url: https://flickr.com/photos/2665...@N07/264...
Image: 2
Title: DSC05555
Media: jpg
Public: true
Url: https://flickr.com/photos/1179...@N04/264...

The list of images returned by this example will vary since we used a fairly wide search range and images are being added all of the time.

There are two approaches that we can use to download an image. The first uses the image's URL and the second uses a Photo object. The image's URL can be obtained from a number of sources. We use the Photo class getUrl method for this example.

In the following sequence, we obtain an instance of PhotosInterface using its constructor to illustrate an alternate approach:

    PhotosInterface pi = new PhotosInterface(apikey, secret,  
        new REST()); 

We get the first Photo instance from the previous list and then its getUrl to get the image's URL. The PhotosInterface class's getImage method returns a BufferedImage object representing the image as shown here:

    Photo currentPhoto = list.get(0);  
    BufferedImage bufferedImage =  
        pi.getImage(currentPhoto.getUrl()); 

The image is then saved to a file using the ImageIO class:

    File outputfile = new File("image.jpg"); 
    ImageIO.write(bufferedImage, "jpg", outputfile); 

The getImage method is overloaded. Here, the Photo instance and the size of the image desired are used as arguments to get the BufferedImage instance:

    bufferedImage = pi.getImage(currentPhoto, Size.SMALL); 

The image can be saved to a file using the previous technique.

The Flickr4Java API supports a number of other techniques for working with Flickr images.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Handling YouTube

YouTube is a popular video site where users can upload and share videos (https://www.youtube.com/). It has been used to share humorous videos, provide instructions on how to do any number of things, and share information among its viewers. It is a useful source of information as it captures the thoughts and ideas of a diverse group of people. This provides an interesting opportunity to analysis and gain insight into human behavior.

YouTube can serve as a useful source of videos and video metadata. A Java API is available to access its contents (https://developers.google.com/youtube/v3/). Detailed documentation of the API is found at https://developers.google.com/youtube/v3/docs/.

In this section, we will demonstrate how to search for videos by keyword and retrieve information of interest. We will also show how to download a video. To use the YouTube API, you will need a Google account, which can be obtained at https://www.google.com/accounts/NewAccount. Next, create an account in the Google Developer's Console (https://console.developers.google.com/). API access is supported using either API keys or OAuth 2.0 credentials. The project creation process and keys are discussed at https://developers.google.com/youtube/registering_an_application#create_project.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Searching by keyword

The process of searching for videos by keyword is adapted from https://developers.google.com/youtube/v3/code_samples/java#search_by_keyword. Other potentially useful code examples can be found at https://developers.google.com/youtube/v3/code_samples/java. The process has been simplified so that we can focus on the search process. We start with a try block and the creation of a YouTube instance. This class provides the basic access to the API. Javadocs for this API is found at https://developers.google.com/resources/api-libraries/documentation/youtube/v3/java/latest/.

The YouTube.Builder class is used to construct a YouTube instance. Its constructor takes three arguments:

  • Transport: Object used for HTTP
  • JSONFactory: Used to process JSON objects
  • HttpRequestInitializer: None is needed for this example

Many of the APIs responses will be in the form of JSON objects. The YouTube class' setApplicationName method gives it a name and the build method creates a new YouTube instance:

    try { 
        YouTube youtube = new YouTube.Builder( 
            Auth.HTTP_TRANSPORT, 
            Auth.JSON_FACTORY, 
            new HttpRequestInitializer() { 
                public void initialize(HttpRequest request)  
                        throws IOException { 
                } 
            }) 
                .setApplicationName("application_name") 
        ... 
    } catch (GoogleJSONResponseException ex) { 
        // Handle exceptions 
    } catch (IOException ex) { 
        // Handle exceptions 
    } 

Next, we initialize a string to hold the search term of interest. In this case, we will look for videos containing the word cats:

    String queryTerm = "cats"; 

The class, YouTube.Search.List, maintains a collection of search results. The YouTube class's search method specifies the type of resource to be returned. In this case, the string specifies the id and snippet portions of the search result to be returned:

    YouTube.Search.List search = youtube 
        .search() 
        .list("id,snippet"); 

The search result is a JSON object that has the following structure. It is described in more detail at https://developers.google.com/youtube/v3/docs/playlistItems#methods. In the previous sequence, only the id and snippet parts of a search will be returned, resulting in a more efficient operation:

{ 
  "kind": "youtube#searchResult", 
  "etag": etag, 
  "id": { 
    "kind": string, 
    "videoId": string, 
    "channelId": string, 
    "playlistId": string 
  }, 
  "snippet": { 
    "publishedAt": datetime, 
    "channelId": string, 
    "title": string, 
    "description": string, 
    "thumbnails": { 
      (key): { 
        "url": string, 
        "width": unsigned integer, 
        "height": unsigned integer 
      } 
    }, 
    "channelTitle": string, 
    "liveBroadcastContent": string 
  } 
} 

Next, we need to specify the API key and various search parameters. The query term is specified, as well as the type of media to be returned. In this case, only videos will be returned. The other two options include channel and playlist:

    String apiKey = "Your API key"; 
    search.setKey(apiKey); 
    search.setQ(queryTerm); 
    search.setType("video"); 

In addition, we further specify the fields to be returned as shown here. These correspond to fields of the JSON object:

    search.setFields("items(id/kind,id/videoId,snippet/title," +  
        "snippet/description,snippet/thumbnails/default/url)"); 

We also specify the maximum number of results to retrieve using the setMaxResults method:

    search.setMaxResults(10L); 

The execute method will perform the actual query, returning a SearchListResponse object. Its getItems method returns a list of SearchResult objects, one for each video retrieved:

    SearchListResponse searchResponse = search.execute(); 
    List<SearchResult> searchResultList =  
        searchResponse.getItems(); 

In this example, we do not iterate through each video returned. Instead, we retrieve the first video and display information about the video. The SearchResult video variable allows us to access different parts of the JSON object, as shown here:

    SearchResult video = searchResultList.iterator().next(); 
    Thumbnail thumbnail = video 
        .getSnippet().getThumbnails().getDefault(); 
 
    out.println("Kind: " + video.getKind()); 
    out.println("Video Id: " + video.getId().getVideoId()); 
    out.println("Title: " + video.getSnippet().getTitle()); 
    out.println("Description: " +  
        video.getSnippet().getDescription()); 
    out.println("Thumbnail: " + thumbnail.getUrl()); 

One possible output follows where parts of the output have been modified:

Kind: null
Video Id: tntO...
Title: Funny Cats ...
Description: Check out the ...
Thumbnail: https://i.ytimg.com/vi/tntO.../default.jpg

We have skipped many error checking steps to simplify the example, but these should be considered when implementing this in a business application.

If we need to download the video, one of the simplest ways is to use axet/wget found at https://github.com/axet/wget. It provides an easy-to-use technique to download the video using its video ID.

In the following example, a URL is created using the video ID. You will need to provide a video ID for this to work properly. The file is saved to the current directory with the video's title as the filename:

    String url = "http://www.youtube.com/watch?v=videoID"; 
    String path = "."; 
    VGet vget = new VGet(new URL(url), new File(path)); 
    vget.download(); 

There are other more sophisticated download techniques found at the GitHub site.

Summary

In this chapter, we discussed types of data that are useful for data science and readily accessible on the Internet. This discussion included details about file specifications and formats for the most common types of data sources.

We also examined Java APIs and other techniques for retrieving data, and illustrated this process with multiple sources. In particular, we focused on types of text-based document formats and multimedia files. We used web crawlers to access websites and then performed web scraping to retrieve data from the sites we encountered.

Finally, we extracted data from social media sites and examined the available Java support. We retrieved data from Twitter, Wikipedia, Flickr, and YouTube and examined the available API support.

You have been reading a chapter from
Machine Learning: End-to-End guide for Java developers
Published in: Oct 2017
Publisher: Packt
ISBN-13: 9781788622219
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image