Chapter 3. Data Cleaning
Real-world data is frequently dirty and unstructured, and must be reworked before it is usable. Data may contain errors, have duplicate entries, exist in the wrong format, or be inconsistent. The process of addressing these types of issues is called data cleaning. Data cleaning is also referred to as data wrangling, massaging, reshaping , or munging. Data merging, where data from multiple sources is combined, is often considered to be a data cleaning activity.
We need to clean data because any analysis based on inaccurate data can produce misleading results. We want to ensure that the data we work with is quality data. Data quality involves:
- Validity: Ensuring that the data possesses the correct form or structure
- Accuracy: The values within the data are truly representative of the dataset
- Completeness: There are no missing elements
- Consistency: Changes to data are in sync
- Uniformity: The same units of measurement are used
There are several techniques and tools used to clean data. We will examine the following approaches:
- Handling different types of data
- Cleaning and manipulating text data
- Filling in missing data
- Validating data
In addition, we will briefly examine several image enhancement techniques.
There are often many ways to accomplish the same cleaning task. For example, there are a number of GUI tools that support data cleaning, such as OpenRefine (http://openrefine.org/). This tool allows a user to read in a dataset and clean it using a variety of techniques. However, it requires a user to interact with the application for each dataset that needs to be cleaned. It is not conducive to automation.
We will focus on how to clean data using Java code. Even then, there may be different techniques to clean the data. We will show multiple approaches to provide the reader with insights on how it can be done. Sometimes, this will use core Java string classes, and at other time, it may use specialized libraries.
These libraries often are more expressive and efficient. However, there are times when using a simple string function is more than adequate to address the problem. Showing complimentary techniques will improve the reader's skill set.
The basic text based tasks include:
- Data transformation
- Data imputation (handling missing data)
- Subsetting data
- Sorting data
- Validating data
In this chapter, we are interested in cleaning data. However, part of this process is extracting information from various data sources. The data may be stored in plaintext or in binary form. We need to understand the various formats used to store data before we can begin the cleaning process. Many of these formats were introduced in Chapter 2, Data Acquisition, but we will go into greater detail in the following sections.
Handling data formats
Data comes in all types of forms. We will examine the more commonly used formats and show how they can be extracted from various data sources. Before we can clean data it needs to be extracted from a data source such as a file. In this section, we will build upon the introduction to data formats found in Chapter 2, Data Acquisition, and show how to extract all or part of a dataset. For example, from an HTML page we may want to extract only the text without markup. Or perhaps we are only interested in its figures.
These data formats can be quite complex. The intent of this section is to illustrate the basic techniques commonly used with that data format. Full treatment of a specific data format is beyond the scope of this book. Specifically, we will introduce how the following data formats can be processed from Java:
- CSV data
- Spreadsheets
- Portable Document Format, or PDF files
- Javascript Object Notation, or JSON files
There are many other file types not addressed here. For example, jsoup is useful for parsing HTML documents. Since we introduced how this is done in the Web scraping in Java section of Chapter 2, Data Acquisition, we will not duplicate the effort here.
Handling CSV data
A common technique for separating information is to use commas or similar separators. Knowing how to work with CSV data allows us to utilize this type of data in our analysis efforts. When we deal with CSV data there are several issues including escaped data and embedded commas.
We will examine a few basic techniques for processing comma-separated data. Due to the row-column structure of CSV data, these techniques will read data from a file and place the data in a two-dimensional array. First, we will use a combination of the Scanner
class to read in tokens and the String
class split
method to separate the data and store it in the array. Next, we will explore using the third-party library, OpenCSV, which offers a more efficient technique.
However, the first approach may only be appropriate for quick and dirty processing of data. We will discuss each of these techniques since they are useful in different situations.
We will use a dataset downloaded from https://www.data.gov/ containing U.S. demographic statistics sorted by ZIP code. This dataset can be downloaded at https://catalog.data.gov/dataset/demographic-statistics-by-zip-code-acfc9. For our purposes, this dataset has been stored in the file Demographics.csv
. In this particular file, every row contains the same number of columns. However, not all data will be this clean and the solutions shown next take into account the possibility for jagged arrays.
Note
A jagged array is an array where the number of columns may be different for different rows. For example, row 2 may have 5 elements while row 3 may have 6 elements. When using jagged arrays you have to be careful with your column indexes.
First, we use the Scanner
class to read in data from our data file. We will temporarily store the data in an ArrayList
since we will not always know how many rows our data contains.
try (Scanner csvData = new Scanner(new File("Demographics.csv"))) { ArrayList<String> list = new ArrayList<String>(); while (csvData.hasNext()) { list.add(csvData.nextLine()); } catch (FileNotFoundException ex) { // Handle exceptions }
The list is converted to an array using the toArray
method. This version of the method uses a String
array as an argument so that the method will know what type of array to create. A two-dimension array is then created to hold the CSV data.
String[] tempArray = list.toArray(new String[1]); String[][] csvArray = new String[tempArray.length][];
The split
method is used to create an array of String
s for each row. This array is assigned to a row of the csvArray
.
for(int i=0; i<tempArray.length; i++) { csvArray[i] = tempArray[i].split(","); }
Our next technique will use a third-party library to read in and process CSV data. There are multiple options available, but we will focus on the popular OpenCSV (http://opencsv.sourceforge.net). This library offers several advantages over our previous technique. We can have an arbitrary number of items on each row without worrying about handling exceptions. We also do not need to worry about embedded commas or embedded carriage returns within the data tokens. The library also allows us to choose between reading the entire file at once or using an iterator to process data line-by-line.
First, we need to create an instance of the CSVReader
class. Notice the second parameter allows us to specify the delimiter, a useful feature if we have similar file format delimited by tabs or dashes, for example. If we want to read the entire file at one time, we use the readAll
method.
CSVReader dataReader = new CSVReader(new FileReader("Demographics.csv"),','); ArrayList<String> holdData = (ArrayList)dataReader.readAll();
We can then process the data as we did above, by splitting the data into a two-dimension array using String
class methods. Alternatively, we can process the data one line at a time. In the example that follows, each token is printed out individually but the tokens can also be stored in a two-dimension array or other data structure as appropriate.
CSVReader dataReader = new CSVReader(new FileReader("Demographics.csv"),','); String[] nextLine; while ((nextLine = dataReader.readNext()) != null){ for(String token : nextLine){ out.println(token); } } dataReader.close();
We can now clean or otherwise process the array.
Handling spreadsheets
Spreadsheets have proven to be a very popular tool for processing numeric and textual data. Due to the wealth of information that has been stored in spreadsheets over the past decades, knowing how to extract information from spreadsheets enables us to take advantage of this widely available data source. In this section, we will demonstrate how this is accomplished using the Apache POI API.
Open Office also supports a spreadsheet application. Open Office documents are stored in XML format which makes it readily accessible using XML parsing technologies. However, the Apache ODF Toolkit (http://incubator.apache.org/odftoolkit/) provides a means of accessing data within a document without knowing the format of the OpenOffice document. This is currently an incubator project and is not fully mature. There are a number of other APIs that can assist in processing OpenOffice documents as detailed on the Open Document Format (ODF) for developers (http://www.opendocumentformat.org/developers/) page.
Handling Excel spreadsheets
Apache POI (http://poi.apache.org/index.html) is a set of APIs providing access to many Microsoft products including Excel and Word. It consists of a series of components designed to access a specific Microsoft product. An overview of these components is found at http://poi.apache.org/overview.html.
In this section we will demonstrate how to read a simple Excel spreadsheet using the XSSF component to access Excel 2007+ spreadsheets. The Javadocs for the Apache POI API is found at https://poi.apache.org/apidocs/index.html.
We will use a simple Excel spreadsheet consisting of a series of rows containing an ID along with minimum, maximum, and average values. These numbers are not intended to represent any specific type of data. The spreadsheet follows:
ID |
Minimum |
Maximum |
Average |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We start with a try-with-resources block to handle any IOExceptions
that may occur:
try (FileInputStream file = new FileInputStream( new File("Sample.xlsx"))) { ... } } catch (IOException e) { // Handle exceptions }
An instance of a XSSFWorkbook
class is created using the spreadsheet. Since a workbook may consists of multiple spreadsheets, we select the first one using the getSheetAt
method.
XSSFWorkbook workbook = new XSSFWorkbook(file); XSSFSheet sheet = workbook.getSheetAt(0);
The next step is to iterate through the rows, and then each column, of the spreadsheet:
for(Row row : sheet) { for (Cell cell : row) { ... } out.println();
Each cell of the spreadsheet may use a different format. We use the getCellType
method to determine its type and then use the appropriate method to extract the data in the cell. In this example we are only working with numeric and text data.
switch (cell.getCellType()) { case Cell.CELL_TYPE_NUMERIC: out.print(cell.getNumericCellValue() + "\t"); break; case Cell.CELL_TYPE_STRING: out.print(cell.getStringCellValue() + "\t"); break; }
When executed we get the following output:
ID Minimum Maximum Average 12345.0 45.0 89.0 65.55 23456.0 78.0 96.0 86.75 34567.0 56.0 89.0 67.44 45678.0 86.0 99.0 95.67
POI supports other more sophisticated classes and methods to extract data.
Handling PDF files
There are several APIs supporting the extraction of text from a PDF file. Here we will use PDFBox. The Apache PDFBox (https://pdfbox.apache.org/) is an open source API that allows Java programmers to work with PDF documents. In this section we will illustrate how to extract simple text from a PDF document. Javadocs for the PDFBox API is found at https://pdfbox.apache.org/docs/2.0.1/javadocs/.
This is a simple PDF file. It consists of several bullets:
- Line 1
- Line 2
- Line 3
This is the end of the document.
A try
block is used to catch IOExceptions
. The PDDocument
class will represent the PDF document being processed. Its load
method will load in the PDF file specified by the File
object:
try { PDDocument document = PDDocument.load(new File("PDF File.pdf")); ... } catch (Exception e) { // Handle exceptions }
Once loaded, the PDFTextStripper
class getText
method will extract the text from the file. The text is then displayed as shown here:
PDFTextStripper Tstripper = new PDFTextStripper(); String documentText = Tstripper.getText(document); System.out.println(documentText);
The output of this example follows. Notice that the bullets are returned as question marks.
This is a simple PDF file. It consists of several bullets: ? Line 1 ? Line 2 ? Line 3 This is the end of the document.
This is a brief introduction to the use of PDFBox. It is a very powerful tool when we need to extract and otherwise manipulate PDF documents.
Handling JSON
In Chapter 2, Data Acquisition we learned that certain YouTube searches return JSON formatted results. Specifically, the SearchResult
class holds information relating to a specific search. In that section we illustrate how to use YouTube specific techniques to extract information. In this section we will illustrate how to extract JSON information using the Jackson JSON implementation.
JSON supports three models for processing data:
- Streaming API - JSON data is processed token by token
- Tree model - The JSON data is held entirely in memory and then processed
- Data binding - The JSON data is transformed to a Java object
Using JSON streaming API
We will illustrate the first two approaches. The first approach is more efficient and is used when a large amount of data is processed. The second technique is convenient but the data must not be too large. The third technique is useful when it is more convenient to use specific Java classes to process data. For example, if the JSON data represent an address then a specific Java address class cane be defined to hold and process the data.
There are several Java libraries that support JSON processing including:
- Flexjson (http://flexjson.sourceforge.net/)
- Genson (http://owlike.github.io/genson/)
- Google-Gson (https://github.com/google/gson)
- Jackson library (https://github.com/FasterXML/jackson)
- JSON-io (https://github.com/jdereg/json-io)
- JSON-lib (http://json-lib.sourceforge.net/)
We will use the Jackson Project (https://github.com/FasterXML/jackson). Documentation is found at https://github.com/FasterXML/jackson-docs. We will use two JSON files to demonstrate how it can be used. The first file, Person.json
, is shown next where a single person data is stored. It consists of four fields where the last field is an array of location information.
{ "firstname":"Smith", "lastname":"Peter", "phone":8475552222, "address":["100 Main Street","Corpus","Oklahoma"] }
The code sequence that follows shows how to extract the values for each of the fields. Within the try-catch block a JsonFactory
instance is created which then creates a JsonParser
instance based on the Person.json
file.
try { JsonFactory jsonfactory = new JsonFactory(); JsonParser parser = jsonfactory.createParser(new File("Person.json")); ... parser.close(); } catch (IOException ex) { // Handle exceptions }
The nextToken
method returns a token
. However, the JsonParser
object keeps track of the current token. In the while
loop the nextToken
method returns and advances the parser to the next token. The getCurrentName
method returns the field name for the token
. The while
loop terminates when the last token is reached.
while (parser.nextToken() != JsonToken.END_OBJECT) { String token = parser.getCurrentName(); ... }
The body of the loop consists of a series of if
statements that processes the field by its name. Since the address
field is an array, another loop will extract each of its elements until the ending array token
is reached.
if ("firstname".equals(token)) { parser.nextToken(); String fname = parser.getText(); out.println("firstname : " + fname); } if ("lastname".equals(token)) { parser.nextToken(); String lname = parser.getText(); out.println("lastname : " + lname); } if ("phone".equals(token)) { parser.nextToken(); long phone = parser.getLongValue(); out.println("phone : " + phone); } if ("address".equals(token)) { out.println("address :"); parser.nextToken(); while (parser.nextToken() != JsonToken.END_ARRAY) { out.println(parser.getText()); } }
The output of this example follows:
firstname : Smith lastname : Peter phone : 8475552222 address : 100 Main Street Corpus Oklahoma
However, JSON objects are frequently more complex than the previous example. Here a Persons.json
file consists of an array of three persons
:
{ "persons": { "groupname": "school", "person": [ {"firstname":"Smith", "lastname":"Peter", "phone":8475552222, "address":["100 Main Street","Corpus","Oklahoma"] }, {"firstname":"King", "lastname":"Sarah", "phone":8475551111, "address":["200 Main Street","Corpus","Oklahoma"] }, {"firstname":"Frost", "lastname":"Nathan", "phone":8475553333, "address":["300 Main Street","Corpus","Oklahoma"] } ] } }
To process this file, we use a similar set of code as shown previously. We create the parser and then enter a loop as before:
try { JsonFactory jsonfactory = new JsonFactory(); JsonParser parser = jsonfactory.createParser(new File("Person.json")); while (parser.nextToken() != JsonToken.END_OBJECT) { String token = parser.getCurrentName(); ... } parser.close(); } catch (IOException ex) { // Handle exceptions }
However, we need to find the persons
field and then extract each of its elements. The groupname
field is extracted and displayed as shown here:
if ("persons".equals(token)) { JsonToken jsonToken = parser.nextToken(); jsonToken = parser.nextToken(); token = parser.getCurrentName(); if ("groupname".equals(token)) { parser.nextToken(); String groupname = parser.getText(); out.println("Group : " + groupname); ... } }
Next, we find the person
field and call a parsePerson
method to better organize the code:
parser.nextToken(); token = parser.getCurrentName(); if ("person".equals(token)) { out.println("Found person"); parsePerson(parser); }
The parsePerson
method follows which is very similar to the process used in the first example.
public void parsePerson(JsonParser parser) throws IOException { while (parser.nextToken() != JsonToken.END_ARRAY) { String token = parser.getCurrentName(); if ("firstname".equals(token)) { parser.nextToken(); String fname = parser.getText(); out.println("firstname : " + fname); } if ("lastname".equals(token)) { parser.nextToken(); String lname = parser.getText(); out.println("lastname : " + lname); } if ("phone".equals(token)) { parser.nextToken(); long phone = parser.getLongValue(); out.println("phone : " + phone); } if ("address".equals(token)) { out.println("address :"); parser.nextToken(); while (parser.nextToken() != JsonToken.END_ARRAY) { out.println(parser.getText()); } } } }
The output follows:
Group : school Found person firstname : Smith lastname : Peter phone : 8475552222 address : 100 Main Street Corpus Oklahoma firstname : King lastname : Sarah phone : 8475551111 address : 200 Main Street Corpus Oklahoma firstname : Frost lastname : Nathan phone : 8475553333address : 300 Main Street Corpus Oklahoma
Using the JSON tree API
The second approach is to use the tree model. An ObjectMapper
instance is used to create a JsonNode
instance using the Persons.json
file. The fieldNames
method returns Iterator
allowing us to process each element of the file.
try { ObjectMapper mapper = new ObjectMapper(); JsonNode node = mapper.readTree(new File("Persons.json")); Iterator<String> fieldNames = node.fieldNames(); while (fieldNames.hasNext()) { ... fieldNames.next(); } } catch (IOException ex) { // Handle exceptions }
Since the JSON file contains a persons
field, we will obtain a JsonNode
instance representing the field and then iterate over each of its elements.
JsonNode personsNode = node.get("persons"); Iterator<JsonNode> elements = personsNode.iterator(); while (elements.hasNext()) { ... }
Each element is processed one at a time. If the element type is a string, we assume that this is the groupname
field.
JsonNode element = elements.next(); JsonNodeType nodeType = element.getNodeType(); if (nodeType == JsonNodeType.STRING) { out.println("Group: " + element.textValue()); }
If the element is an array, we assume it contains a series of persons where each person is processed by the parsePerson
method:
if (nodeType == JsonNodeType.ARRAY) { Iterator<JsonNode> fields = element.iterator(); while (fields.hasNext()) { parsePerson(fields.next()); } }
The parsePerson
method is shown next:
public void parsePerson(JsonNode node) { Iterator<JsonNode> fields = node.iterator(); while(fields.hasNext()) { JsonNode subNode = fields.next(); out.println(subNode.asText()); } }
The output follows:
Group: school Smith Peter 8475552222 King Sarah 8475551111 Frost Nathan 8475553333
There is much more to JSON than we are able to illustrate here. However, this should give you an idea of how this type of data can be handled.
The nitty gritty of cleaning text
Strings are used to support text processing so using a good string library is important. Unfortunately, the java.lang.String
class has some limitations. To address these limitations, you can either implement your own special string functions as needed or you can use a third-party library.
Creating your own library can be useful, but you will basically be reinventing the wheel. It may be faster to write a simple code sequence to implement some functionality, but to do things right, you will need to test them. Third-party libraries have already been tested and have been used on hundreds of projects. They provide a more efficient way of processing text.
There are several text processing APIs in addition to those found in Java. We will demonstrate two of these:
- Apache Commons: https://commons.apache.org/
- Guava: https://github.com/google/guava
Java provides many supports for cleaning text data, including methods in the String
class. These methods are ideal for simple text cleaning and small amounts of data but can also be efficient with larger, complex datasets. We will demonstrate several String
class methods in a moment. Some of the most helpful String
class methods are summarized in the following table:
Method Name |
Return Type |
Description |
|
|
Removes leading and trailing blank spaces |
|
|
Changes the casing of the entire string |
|
|
Replaces all occurrences of a character sequence within the string |
|
|
Determines whether a given character sequence exists within the string |
|
|
Compares two strings lexographically and returns an integer representing their relationship |
|
|
Determines whether the string matches a given regular expression |
|
|
Combines two or more strings with a specified delimiter |
|
|
Separates elements of a given string using a specified delimiter |
Many text operations are simplified by the use of regular expressions. Regular expressions use standardized syntax to represent patterns in text, which can be used to locate and manipulate text matching the pattern.
A regular expression is simply a string itself. For example, the string Hello, my name is Sally
can be used as a regular expression to find those exact words within a given text. This is very specific and not broadly applicable, but we can use a different regular expression to make our code more effective. Hello, my name is \\w
will match any text that starts with Hello, my name is
and ends with a word character.
We will use several examples of more complex regular expressions, and some of the more useful syntax options are summarized in the following table. Note each must be double-escaped when used in a Java application.
Option |
Description |
|
Any digit: 0-9 |
|
Any non-digit |
|
Any whitespace character |
|
Any non-whitespace character |
|
Any word character (including digits): A-Z, a-z, and 0-9 |
|
Any non-word character |
The size and source of text data varies wildly from application to application but the methods used to transform the data remain the same. You may actually need to read data from a file, but for simplicity's sake, we will be using a string containing the beginning sentences of Herman Melville's Moby Dick for several examples within this chapter. Unless otherwise specified, the text will assumed to be as shown next:
String dirtyText = "Call me Ishmael. Some years ago- never mind how"; dirtyText += " long precisely - having little or no money in my purse,"; dirtyText += " and nothing particular to interest me on shore, I thought"; dirtyText += " I would sail about a little and see the watery part of the world.";
Using Java tokenizers to extract words
Often it is most efficient to analyze text data as tokens. There are multiple tokenizers available in the core Java libraries as well as third-party tokenizers. We will demonstrate various tokenizers throughout this chapter. The ideal tokenizer will depend upon the limitations and requirements of an individual application.
Java core tokenizers
StringTokenizer
was the first and most basic tokenizer and has been available since Java 1. It is not recommended for use in new development as the String
class's split
method is considered more efficient. While it does provide a speed advantage for files with narrowly defined and set delimiters, it is less flexible than other tokenizer options. The following is a simple implementation of the StringTokenizer
class that splits a string on spaces:
StringTokenizer tokenizer = new StringTokenizer(dirtyText," "); while(tokenizer.hasMoreTokens()){ out.print(tokenizer.nextToken() + " "); }
When we set the dirtyText
variable to hold our text from Moby Dick, shown previously, we get the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely...
StreamTokenizer
is another core Java tokenizer. StreamTokenizer
grants more information about the tokens retrieved, and allows the user to specify data types to parse, but is considered more difficult to use than StreamTokenizer
or the split
method. The String
class split
method is the simplest way to split strings up based on a delimiter, but it does not provide a way to parse the split strings and you can only specify one delimiter for the entire string. For these reasons, it is not a true tokenizer, but it can be useful for data cleaning.
The Scanner
class is designed to allow you to parse strings into different data types. We used it previously in the Handling CSV data section and we will address it again in the Removing stop words section.
Third-party tokenizers and libraries
Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer
. This class provides more advanced support than the standard StringTokenizer
class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer
:
StrTokenizer tokenizer = new StrTokenizer(text); while (tokenizer.hasNext()) { out.print(tokenizer.next() + " "); }
This operates in a similar fashion to StringTokenizer
and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.
When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...
We can modify our constructor as follows:
StrTokenizer tokenizer = new StrTokenizer(text,",");
The output for this implementation is:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse and nothing particular to interest me on shore I thought I would sail about a little and see the watery part of the world.
Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher
object.
Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner
class and the Splitter
class. Tokenization is accomplished in Guava using its Splitter
class's split
method. The following is a simple example:
Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); Iterable<String> words = simpleSplit.split(dirtyText); for(String token: words){ out.print(token); }
This splits each token on commas and produces output like our last example. We can modify the parameter of the on
method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.
LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory
interface. We implement a LingPipe IndoEuropeanTokenizerFactory
tokenizer in the Simple text cleaning section.
Transforming data into a usable form
Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.
Simple text cleaning
We will use the string shown before from Moby Dick to demonstrate some of the basic String
class methods. Notice the use of the toLowerCase
and trim
methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll
method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:
out.println(dirtyText); dirtyText = dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); dirtyText = dirtyText.trim(); while(dirtyText.contains(" ")){ dirtyText = dirtyText.replaceAll(" ", " "); } out.println(dirtyText);
When executed, the code produces the following output, truncated:
Call me Ishmael. Some years ago- never mind how long precisely - call me ishmael some years ago never mind how long precisely
Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String
array. The split
method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W
, which represents anything that is not a word character:
out.println(dirtyText); dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); for(String clean : cleanText){ out.print(clean + " "); }
This code produces the same output as shown previously.
Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join
method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join
method joins every word in the array words
and inserts a space between each word:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+"); String cleanText = String.join(" ", words); out.println(cleanText);
Again, this code produces the same output as shown previously. An alternate version of the join
method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner
class:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+"); String cleanText = Joiner.on(" ").skipNulls().join(words); out.println(cleanText);
This version provides additional options, including skipping nulls, as shown before. The output remains the same.
Removing stop words
Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner
object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList
using the Arrays
class's asList
method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String
class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:
Scanner readStop = new Scanner(new File("C://stopwords.txt")); ArrayList<String> words = new ArrayList<String>(Arrays.asList((dirtyText)); out.println("Original clean text: " + words.toString());
We also create a new ArrayList
to hold a list of stop words actually found in our text. This will allow us to use the ArrayList
class removeAll
method shortly. Next, we use our Scanner
to read through our file of stop words. Notice how we also call the toLowerCase
and trim
methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains
method to determine whether our text contains the given stop word. If so, we add it to our foundWords
ArrayList. Once we have processed all the stop words, we call removeAll
to remove them from our text:
ArrayList<String> foundWords = new ArrayList(); while(readStop.hasNextLine()){ String stopWord = readStop.nextLine().toLowerCase(); if(words.contains(stopWord)){ foundWords.add(stopWord); } } words.removeAll(foundWords); out.println("Text without stop words: " + words.toString());
The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:
Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world] Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely
There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory
class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer
method uses a char
array, so we call toCharArray
against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:
text = text.toLowerCase().trim(); TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; fact = new EnglishStopTokenizerFactory(fact); Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); for(String word : tok){ out.print(word + " "); }
The output follows:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .
Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.
Finding words in text
The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches
method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains
method and the equals
method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); int count = 0;
Next, we call the contains
method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals
method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:
if(dirtyText.contains(toFind)){ String[] words = dirtyText.split(" "); for(String word : words){ if(word.equals(toFind)){ count++; } } out.println("Found " + toFind + " " + count + " times in the text."); }
In this example, we set toFind
to the letter I
. This produced the following output:
Found i 2 times in the text.
We also have the option to use the Scanner
class to search through an entire file. One helpful method is the findWithinHorizon
method. This uses a Scanner
to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner
will be searched by default:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); Scanner textLine = new Scanner(dirtyText); out.println("Found " + textLine.findWithinHorizon(toFind, 10));
This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.
It can also be more efficient to search an entire file using a BufferedReader
. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader
object from our path and process our file as long as the next line is not empty:
String path = "C:// MobyDick.txt"; try { String textLine = ""; toFind = toFind.toLowerCase().trim(); BufferedReader textToClean = new BufferedReader( new FileReader(path)); while((textLine = textToClean.readLine()) != null){ line++; if(textLine.toLowerCase().trim().contains(toFind)){ out.println("Found " + toFind + " in " + textLine); } } textToClean.close(); } catch (IOException ex) { // Handle exceptions }
We again test our data by searching for the word I
in the first sentences of Moby Dick. The truncated output follows:
Found i in Call me Ishmael...
Finding and replacing text
We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains
method. If we find the text, we call the replaceAll
method to modify our string:
text = text.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); out.println(text); if(text.contains(toFind)){ text = text.replaceAll(toFind, replaceWith); out.println(text); }
To test this code, we set toFind
to the word I
and replaceWith
to Ishmael
. Our output follows:
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world. call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.
Apache Commons also provides a replace
method with several variations in the StringUtils
class. This class provides much of the same functionality as the String
class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me
with X
to demonstrate the replace
method:
out.println(text); out.println(StringUtils.replace(text, "me", "X"));
The truncated output follows:
Call me Ishmael. Some years ago- never mind how long precisely - Call X Ishmael. SoX years ago- never mind how long precisely -
Notice how every instance of me
has been replaced, even those instances contained within other words, such as some.
This can be avoided by adding spaces around me
, although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.
The StringUtils
class also provides a replacePattern
method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:
out.println(text); text = StringUtils.replacePattern(text, "\\W\\s", " "); out.println(text);
This will produce the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - Call me Ishmael Some years ago never mind how long precisely
Google Guava provides additional support for matching and modify text data using the CharMatcher
class. CharMatcher
not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.
In this example, we are going to use the replace
method to simply replace all instances of the word me
with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom
method and print our string again:
text = text.replace("me", " "); out.println("With double spaces: " + text); String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); out.println("With double spaces removed: " + spaced);
Our output is truncated as follows:
With double spaces: Call Ishmael. So years ago- ... With double spaces removed: Call Ishmael. So years ago- ...
Data imputation
Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.
Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.
To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.
We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList
contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:
double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; double sum = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
Notice that for the numbers used in this execution, the output is as follows:
The average temperature is 70.33
Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum
:
double sum = 0; tempList[0] = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
This will change the average temperature displayed in our output:
The average temperature is 66.17
Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.
One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.
When it is essential to handle null data, Java's Optional
class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null
for the purposes of demonstrating these methods:
String useName = ""; String[] nameList = {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; Optional<String> tempName; for(String name : nameList){ tempName = Optional.ofNullable(name); useName = tempName.orElse("DEFAULT"); out.println("Name to use = " + useName); }
We first created a variable called useName
to hold the name we will actually print out. We also created an instance of the Optional
class called tempName
. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional
class ofNullable
method. This method tests whether a particular value is null or not. On the next line, we call the orElse
method to either assign a value from the array to useName
or, if the element is null, assign DEFAULT
. Our output follows:
Name to use = Amy Name to use = Bob Name to use = Sally Name to use = Sue Name to use = Don Name to use = Rick Name to use = DEFAULT Name to use = Betsy
The Optional
class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.
Subsetting data
It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet
method of the SortedSet
interface. We will begin by storing a list of numbers in a TreeSet
. We then create a new TreeSet
object to hold the subset retrieved from the list. Next, we print out our original list:
Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new ArrayList<>(Arrays.asList(nums))); SortedSet<Integer> partNumsList; out.println("Original List: " + fullNumsList.toString() + " " + fullNumsList.last());
The subSet
method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12
and 46
:
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); out.println("SubSet of List: " + partNumsList.toString() + " " + partNumsList.size());
Our output follows:
Original List: [12, 14, 34, 44, 46, 52, 87, 123] SubSet of List: [12, 14, 34, 44]
Another option is to use the stream
method in conjunction with the skip
method. The stream
method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList
as in the previous example, but this time we will specify how many elements to skip with the skip
method. We will also use the collect
method to create a new Set
to hold the new elements:
out.println("Original List: " + numsList.toString()); Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); Set<Integer> partNumsList = fullNumsList .stream() .skip(5) .collect(toCollection(TreeSet::new)); out.println("SubSet of List: " + partNumsList.toString());
When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet
, we will actually be omitting the five lowest numbers:
Original List: [12, 46, 52, 34, 87, 123, 14, 44] SubSet of List: [52, 87, 123]
At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader
to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:
try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { br .lines() .filter(s -> !s.equals("")) .forEach(s -> out.println(s)); } catch (IOException ex) { // Handle exceptions }
Sorting text
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Using Java tokenizers to extract words
Often it is most efficient to analyze text data as tokens. There are multiple tokenizers available in the core Java libraries as well as third-party tokenizers. We will demonstrate various tokenizers throughout this chapter. The ideal tokenizer will depend upon the limitations and requirements of an individual application.
Java core tokenizers
StringTokenizer
was the first and most basic tokenizer and has been available since Java 1. It is not recommended for use in new development as the String
class's split
method is considered more efficient. While it does provide a speed advantage for files with narrowly defined and set delimiters, it is less flexible than other tokenizer options. The following is a simple implementation of the StringTokenizer
class that splits a string on spaces:
StringTokenizer tokenizer = new StringTokenizer(dirtyText," "); while(tokenizer.hasMoreTokens()){ out.print(tokenizer.nextToken() + " "); }
When we set the dirtyText
variable to hold our text from Moby Dick, shown previously, we get the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely...
StreamTokenizer
is another core Java tokenizer. StreamTokenizer
grants more information about the tokens retrieved, and allows the user to specify data types to parse, but is considered more difficult to use than StreamTokenizer
or the split
method. The String
class split
method is the simplest way to split strings up based on a delimiter, but it does not provide a way to parse the split strings and you can only specify one delimiter for the entire string. For these reasons, it is not a true tokenizer, but it can be useful for data cleaning.
The Scanner
class is designed to allow you to parse strings into different data types. We used it previously in the Handling CSV data section and we will address it again in the Removing stop words section.
Third-party tokenizers and libraries
Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer
. This class provides more advanced support than the standard StringTokenizer
class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer
:
StrTokenizer tokenizer = new StrTokenizer(text); while (tokenizer.hasNext()) { out.print(tokenizer.next() + " "); }
This operates in a similar fashion to StringTokenizer
and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.
When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...
We can modify our constructor as follows:
StrTokenizer tokenizer = new StrTokenizer(text,",");
The output for this implementation is:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse and nothing particular to interest me on shore I thought I would sail about a little and see the watery part of the world.
Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher
object.
Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner
class and the Splitter
class. Tokenization is accomplished in Guava using its Splitter
class's split
method. The following is a simple example:
Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); Iterable<String> words = simpleSplit.split(dirtyText); for(String token: words){ out.print(token); }
This splits each token on commas and produces output like our last example. We can modify the parameter of the on
method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.
LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory
interface. We implement a LingPipe IndoEuropeanTokenizerFactory
tokenizer in the Simple text cleaning section.
Transforming data into a usable form
Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.
Simple text cleaning
We will use the string shown before from Moby Dick to demonstrate some of the basic String
class methods. Notice the use of the toLowerCase
and trim
methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll
method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:
out.println(dirtyText); dirtyText = dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); dirtyText = dirtyText.trim(); while(dirtyText.contains(" ")){ dirtyText = dirtyText.replaceAll(" ", " "); } out.println(dirtyText);
When executed, the code produces the following output, truncated:
Call me Ishmael. Some years ago- never mind how long precisely - call me ishmael some years ago never mind how long precisely
Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String
array. The split
method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W
, which represents anything that is not a word character:
out.println(dirtyText); dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); for(String clean : cleanText){ out.print(clean + " "); }
This code produces the same output as shown previously.
Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join
method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join
method joins every word in the array words
and inserts a space between each word:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+"); String cleanText = String.join(" ", words); out.println(cleanText);
Again, this code produces the same output as shown previously. An alternate version of the join
method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner
class:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+"); String cleanText = Joiner.on(" ").skipNulls().join(words); out.println(cleanText);
This version provides additional options, including skipping nulls, as shown before. The output remains the same.
Removing stop words
Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner
object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList
using the Arrays
class's asList
method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String
class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:
Scanner readStop = new Scanner(new File("C://stopwords.txt")); ArrayList<String> words = new ArrayList<String>(Arrays.asList((dirtyText)); out.println("Original clean text: " + words.toString());
We also create a new ArrayList
to hold a list of stop words actually found in our text. This will allow us to use the ArrayList
class removeAll
method shortly. Next, we use our Scanner
to read through our file of stop words. Notice how we also call the toLowerCase
and trim
methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains
method to determine whether our text contains the given stop word. If so, we add it to our foundWords
ArrayList. Once we have processed all the stop words, we call removeAll
to remove them from our text:
ArrayList<String> foundWords = new ArrayList(); while(readStop.hasNextLine()){ String stopWord = readStop.nextLine().toLowerCase(); if(words.contains(stopWord)){ foundWords.add(stopWord); } } words.removeAll(foundWords); out.println("Text without stop words: " + words.toString());
The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:
Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world] Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely
There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory
class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer
method uses a char
array, so we call toCharArray
against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:
text = text.toLowerCase().trim(); TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; fact = new EnglishStopTokenizerFactory(fact); Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); for(String word : tok){ out.print(word + " "); }
The output follows:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .
Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.
Finding words in text
The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches
method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains
method and the equals
method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); int count = 0;
Next, we call the contains
method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals
method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:
if(dirtyText.contains(toFind)){ String[] words = dirtyText.split(" "); for(String word : words){ if(word.equals(toFind)){ count++; } } out.println("Found " + toFind + " " + count + " times in the text."); }
In this example, we set toFind
to the letter I
. This produced the following output:
Found i 2 times in the text.
We also have the option to use the Scanner
class to search through an entire file. One helpful method is the findWithinHorizon
method. This uses a Scanner
to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner
will be searched by default:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); Scanner textLine = new Scanner(dirtyText); out.println("Found " + textLine.findWithinHorizon(toFind, 10));
This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.
It can also be more efficient to search an entire file using a BufferedReader
. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader
object from our path and process our file as long as the next line is not empty:
String path = "C:// MobyDick.txt"; try { String textLine = ""; toFind = toFind.toLowerCase().trim(); BufferedReader textToClean = new BufferedReader( new FileReader(path)); while((textLine = textToClean.readLine()) != null){ line++; if(textLine.toLowerCase().trim().contains(toFind)){ out.println("Found " + toFind + " in " + textLine); } } textToClean.close(); } catch (IOException ex) { // Handle exceptions }
We again test our data by searching for the word I
in the first sentences of Moby Dick. The truncated output follows:
Found i in Call me Ishmael...
Finding and replacing text
We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains
method. If we find the text, we call the replaceAll
method to modify our string:
text = text.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); out.println(text); if(text.contains(toFind)){ text = text.replaceAll(toFind, replaceWith); out.println(text); }
To test this code, we set toFind
to the word I
and replaceWith
to Ishmael
. Our output follows:
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world. call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.
Apache Commons also provides a replace
method with several variations in the StringUtils
class. This class provides much of the same functionality as the String
class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me
with X
to demonstrate the replace
method:
out.println(text); out.println(StringUtils.replace(text, "me", "X"));
The truncated output follows:
Call me Ishmael. Some years ago- never mind how long precisely - Call X Ishmael. SoX years ago- never mind how long precisely -
Notice how every instance of me
has been replaced, even those instances contained within other words, such as some.
This can be avoided by adding spaces around me
, although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.
The StringUtils
class also provides a replacePattern
method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:
out.println(text); text = StringUtils.replacePattern(text, "\\W\\s", " "); out.println(text);
This will produce the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - Call me Ishmael Some years ago never mind how long precisely
Google Guava provides additional support for matching and modify text data using the CharMatcher
class. CharMatcher
not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.
In this example, we are going to use the replace
method to simply replace all instances of the word me
with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom
method and print our string again:
text = text.replace("me", " "); out.println("With double spaces: " + text); String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); out.println("With double spaces removed: " + spaced);
Our output is truncated as follows:
With double spaces: Call Ishmael. So years ago- ... With double spaces removed: Call Ishmael. So years ago- ...
Data imputation
Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.
Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.
To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.
We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList
contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:
double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; double sum = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
Notice that for the numbers used in this execution, the output is as follows:
The average temperature is 70.33
Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum
:
double sum = 0; tempList[0] = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
This will change the average temperature displayed in our output:
The average temperature is 66.17
Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.
One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.
When it is essential to handle null data, Java's Optional
class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null
for the purposes of demonstrating these methods:
String useName = ""; String[] nameList = {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; Optional<String> tempName; for(String name : nameList){ tempName = Optional.ofNullable(name); useName = tempName.orElse("DEFAULT"); out.println("Name to use = " + useName); }
We first created a variable called useName
to hold the name we will actually print out. We also created an instance of the Optional
class called tempName
. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional
class ofNullable
method. This method tests whether a particular value is null or not. On the next line, we call the orElse
method to either assign a value from the array to useName
or, if the element is null, assign DEFAULT
. Our output follows:
Name to use = Amy Name to use = Bob Name to use = Sally Name to use = Sue Name to use = Don Name to use = Rick Name to use = DEFAULT Name to use = Betsy
The Optional
class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.
Subsetting data
It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet
method of the SortedSet
interface. We will begin by storing a list of numbers in a TreeSet
. We then create a new TreeSet
object to hold the subset retrieved from the list. Next, we print out our original list:
Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new ArrayList<>(Arrays.asList(nums))); SortedSet<Integer> partNumsList; out.println("Original List: " + fullNumsList.toString() + " " + fullNumsList.last());
The subSet
method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12
and 46
:
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); out.println("SubSet of List: " + partNumsList.toString() + " " + partNumsList.size());
Our output follows:
Original List: [12, 14, 34, 44, 46, 52, 87, 123] SubSet of List: [12, 14, 34, 44]
Another option is to use the stream
method in conjunction with the skip
method. The stream
method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList
as in the previous example, but this time we will specify how many elements to skip with the skip
method. We will also use the collect
method to create a new Set
to hold the new elements:
out.println("Original List: " + numsList.toString()); Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); Set<Integer> partNumsList = fullNumsList .stream() .skip(5) .collect(toCollection(TreeSet::new)); out.println("SubSet of List: " + partNumsList.toString());
When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet
, we will actually be omitting the five lowest numbers:
Original List: [12, 46, 52, 34, 87, 123, 14, 44] SubSet of List: [52, 87, 123]
At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader
to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:
try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { br .lines() .filter(s -> !s.equals("")) .forEach(s -> out.println(s)); } catch (IOException ex) { // Handle exceptions }
Sorting text
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Java core tokenizers
StringTokenizer
was the first and most basic tokenizer and has been available since Java 1. It is not recommended for use in new development as the String
class's split
method is considered more efficient. While it does provide a speed advantage for files with narrowly defined and set delimiters, it is less flexible than other tokenizer options. The following is a simple implementation of the StringTokenizer
class that splits a string on spaces:
StringTokenizer tokenizer = new StringTokenizer(dirtyText," "); while(tokenizer.hasMoreTokens()){ out.print(tokenizer.nextToken() + " "); }
When we set the dirtyText
variable to hold our text from Moby Dick, shown previously, we get the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely...
StreamTokenizer
is another core Java tokenizer. StreamTokenizer
grants more information about the tokens retrieved, and allows the user to specify data types to parse, but is considered more difficult to use than StreamTokenizer
or the split
method. The String
class split
method is the simplest way to split strings up based on a delimiter, but it does not provide a way to parse the split strings and you can only specify one delimiter for the entire string. For these reasons, it is not a true tokenizer, but it can be useful for data cleaning.
The Scanner
class is designed to allow you to parse strings into different data types. We used it previously in the Handling CSV data section and we will address it again in the Removing stop words section.
Third-party tokenizers and libraries
Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer
. This class provides more advanced support than the standard StringTokenizer
class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer
:
StrTokenizer tokenizer = new StrTokenizer(text); while (tokenizer.hasNext()) { out.print(tokenizer.next() + " "); }
This operates in a similar fashion to StringTokenizer
and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.
When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...
We can modify our constructor as follows:
StrTokenizer tokenizer = new StrTokenizer(text,",");
The output for this implementation is:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse and nothing particular to interest me on shore I thought I would sail about a little and see the watery part of the world.
Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher
object.
Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner
class and the Splitter
class. Tokenization is accomplished in Guava using its Splitter
class's split
method. The following is a simple example:
Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); Iterable<String> words = simpleSplit.split(dirtyText); for(String token: words){ out.print(token); }
This splits each token on commas and produces output like our last example. We can modify the parameter of the on
method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.
LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory
interface. We implement a LingPipe IndoEuropeanTokenizerFactory
tokenizer in the Simple text cleaning section.
Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.
Simple text cleaning
We will use the string shown before from Moby Dick to demonstrate some of the basic String
class methods. Notice the use of the toLowerCase
and trim
methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll
method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:
out.println(dirtyText); dirtyText = dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); dirtyText = dirtyText.trim(); while(dirtyText.contains(" ")){ dirtyText = dirtyText.replaceAll(" ", " "); } out.println(dirtyText);
When executed, the code produces the following output, truncated:
Call me Ishmael. Some years ago- never mind how long precisely - call me ishmael some years ago never mind how long precisely
Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String
array. The split
method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W
, which represents anything that is not a word character:
out.println(dirtyText); dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); for(String clean : cleanText){ out.print(clean + " "); }
This code produces the same output as shown previously.
Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join
method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join
method joins every word in the array words
and inserts a space between each word:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+"); String cleanText = String.join(" ", words); out.println(cleanText);
Again, this code produces the same output as shown previously. An alternate version of the join
method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner
class:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+"); String cleanText = Joiner.on(" ").skipNulls().join(words); out.println(cleanText);
This version provides additional options, including skipping nulls, as shown before. The output remains the same.
Removing stop words
Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner
object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList
using the Arrays
class's asList
method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String
class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:
Scanner readStop = new Scanner(new File("C://stopwords.txt")); ArrayList<String> words = new ArrayList<String>(Arrays.asList((dirtyText)); out.println("Original clean text: " + words.toString());
We also create a new ArrayList
to hold a list of stop words actually found in our text. This will allow us to use the ArrayList
class removeAll
method shortly. Next, we use our Scanner
to read through our file of stop words. Notice how we also call the toLowerCase
and trim
methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains
method to determine whether our text contains the given stop word. If so, we add it to our foundWords
ArrayList. Once we have processed all the stop words, we call removeAll
to remove them from our text:
ArrayList<String> foundWords = new ArrayList(); while(readStop.hasNextLine()){ String stopWord = readStop.nextLine().toLowerCase(); if(words.contains(stopWord)){ foundWords.add(stopWord); } } words.removeAll(foundWords); out.println("Text without stop words: " + words.toString());
The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:
Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world] Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely
There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory
class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer
method uses a char
array, so we call toCharArray
against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:
text = text.toLowerCase().trim(); TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; fact = new EnglishStopTokenizerFactory(fact); Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); for(String word : tok){ out.print(word + " "); }
The output follows:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .
Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.
The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches
method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains
method and the equals
method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); int count = 0;
Next, we call the contains
method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals
method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:
if(dirtyText.contains(toFind)){ String[] words = dirtyText.split(" "); for(String word : words){ if(word.equals(toFind)){ count++; } } out.println("Found " + toFind + " " + count + " times in the text."); }
In this example, we set toFind
to the letter I
. This produced the following output:
Found i 2 times in the text.
We also have the option to use the Scanner
class to search through an entire file. One helpful method is the findWithinHorizon
method. This uses a Scanner
to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner
will be searched by default:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); Scanner textLine = new Scanner(dirtyText); out.println("Found " + textLine.findWithinHorizon(toFind, 10));
This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.
It can also be more efficient to search an entire file using a BufferedReader
. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader
object from our path and process our file as long as the next line is not empty:
String path = "C:// MobyDick.txt"; try { String textLine = ""; toFind = toFind.toLowerCase().trim(); BufferedReader textToClean = new BufferedReader( new FileReader(path)); while((textLine = textToClean.readLine()) != null){ line++; if(textLine.toLowerCase().trim().contains(toFind)){ out.println("Found " + toFind + " in " + textLine); } } textToClean.close(); } catch (IOException ex) { // Handle exceptions }
We again test our data by searching for the word I
in the first sentences of Moby Dick. The truncated output follows:
Found i in Call me Ishmael...
Finding and replacing text
We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains
method. If we find the text, we call the replaceAll
method to modify our string:
text = text.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); out.println(text); if(text.contains(toFind)){ text = text.replaceAll(toFind, replaceWith); out.println(text); }
To test this code, we set toFind
to the word I
and replaceWith
to Ishmael
. Our output follows:
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world. call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.
Apache Commons also provides a replace
method with several variations in the StringUtils
class. This class provides much of the same functionality as the String
class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me
with X
to demonstrate the replace
method:
out.println(text); out.println(StringUtils.replace(text, "me", "X"));
The truncated output follows:
Call me Ishmael. Some years ago- never mind how long precisely - Call X Ishmael. SoX years ago- never mind how long precisely -
Notice how every instance of me
has been replaced, even those instances contained within other words, such as some.
This can be avoided by adding spaces around me
, although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.
The StringUtils
class also provides a replacePattern
method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:
out.println(text); text = StringUtils.replacePattern(text, "\\W\\s", " "); out.println(text);
This will produce the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - Call me Ishmael Some years ago never mind how long precisely
Google Guava provides additional support for matching and modify text data using the CharMatcher
class. CharMatcher
not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.
In this example, we are going to use the replace
method to simply replace all instances of the word me
with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom
method and print our string again:
text = text.replace("me", " "); out.println("With double spaces: " + text); String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); out.println("With double spaces removed: " + spaced);
Our output is truncated as follows:
With double spaces: Call Ishmael. So years ago- ... With double spaces removed: Call Ishmael. So years ago- ...
Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.
Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.
To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.
We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList
contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:
double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; double sum = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
Notice that for the numbers used in this execution, the output is as follows:
The average temperature is 70.33
Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum
:
double sum = 0; tempList[0] = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
This will change the average temperature displayed in our output:
The average temperature is 66.17
Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.
One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.
When it is essential to handle null data, Java's Optional
class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null
for the purposes of demonstrating these methods:
String useName = ""; String[] nameList = {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; Optional<String> tempName; for(String name : nameList){ tempName = Optional.ofNullable(name); useName = tempName.orElse("DEFAULT"); out.println("Name to use = " + useName); }
We first created a variable called useName
to hold the name we will actually print out. We also created an instance of the Optional
class called tempName
. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional
class ofNullable
method. This method tests whether a particular value is null or not. On the next line, we call the orElse
method to either assign a value from the array to useName
or, if the element is null, assign DEFAULT
. Our output follows:
Name to use = Amy Name to use = Bob Name to use = Sally Name to use = Sue Name to use = Don Name to use = Rick Name to use = DEFAULT Name to use = Betsy
The Optional
class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.
It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet
method of the SortedSet
interface. We will begin by storing a list of numbers in a TreeSet
. We then create a new TreeSet
object to hold the subset retrieved from the list. Next, we print out our original list:
Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new ArrayList<>(Arrays.asList(nums))); SortedSet<Integer> partNumsList; out.println("Original List: " + fullNumsList.toString() + " " + fullNumsList.last());
The subSet
method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12
and 46
:
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); out.println("SubSet of List: " + partNumsList.toString() + " " + partNumsList.size());
Our output follows:
Original List: [12, 14, 34, 44, 46, 52, 87, 123] SubSet of List: [12, 14, 34, 44]
Another option is to use the stream
method in conjunction with the skip
method. The stream
method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList
as in the previous example, but this time we will specify how many elements to skip with the skip
method. We will also use the collect
method to create a new Set
to hold the new elements:
out.println("Original List: " + numsList.toString()); Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); Set<Integer> partNumsList = fullNumsList .stream() .skip(5) .collect(toCollection(TreeSet::new)); out.println("SubSet of List: " + partNumsList.toString());
When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet
, we will actually be omitting the five lowest numbers:
Original List: [12, 46, 52, 34, 87, 123, 14, 44] SubSet of List: [52, 87, 123]
At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader
to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:
try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { br .lines() .filter(s -> !s.equals("")) .forEach(s -> out.println(s)); } catch (IOException ex) { // Handle exceptions }
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Third-party tokenizers and libraries
Apache Commons consists of sets of open source Java classes and methods. These provide reusable code that complements the standard Java APIs. One popular class included in the Commons is StrTokenizer
. This class provides more advanced support than the standard StringTokenizer
class, specifically more control and flexibility. The following is a simple implementation of the StrTokenizer
:
StrTokenizer tokenizer = new StrTokenizer(text); while (tokenizer.hasNext()) { out.print(tokenizer.next() + " "); }
This operates in a similar fashion to StringTokenizer
and by default parses tokens on spaces. The constructor can specify the delimiter as well as how to handle double quotes contained in data.
When we use the string from Moby Dick, shown previously, the first tokenizer implementation produces the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse...
We can modify our constructor as follows:
StrTokenizer tokenizer = new StrTokenizer(text,",");
The output for this implementation is:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse and nothing particular to interest me on shore I thought I would sail about a little and see the watery part of the world.
Notice how each line is split where commas existed in the original text. This delimiter can be a simple char, as we have shown, or a more complex StrMatcher
object.
Google Guava is an open source set of utility Java classes and methods. The primary goal of Guava, as with many APIs, is to relieve the burden of writing basic Java utilities so developers can focus on business processes. We are going to talk about two main tools in Guava in this chapter: the Joiner
class and the Splitter
class. Tokenization is accomplished in Guava using its Splitter
class's split
method. The following is a simple example:
Splitter simpleSplit = Splitter.on(',').omitEmptyStrings().trimResults(); Iterable<String> words = simpleSplit.split(dirtyText); for(String token: words){ out.print(token); }
This splits each token on commas and produces output like our last example. We can modify the parameter of the on
method to split on the character of our choosing. Notice the method chaining which allows us to omit empty strings and trim leading and trailing spaces. For these reasons, and other advanced capabilities, Google Guava is considered by some to be the best tokenizer available for Java.
LingPipe is a linguistical toolkit available for language processing in Java. It provides more specialized support for text splitting with its TokenizerFactory
interface. We implement a LingPipe IndoEuropeanTokenizerFactory
tokenizer in the Simple text cleaning section.
Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.
Simple text cleaning
We will use the string shown before from Moby Dick to demonstrate some of the basic String
class methods. Notice the use of the toLowerCase
and trim
methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll
method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:
out.println(dirtyText); dirtyText = dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); dirtyText = dirtyText.trim(); while(dirtyText.contains(" ")){ dirtyText = dirtyText.replaceAll(" ", " "); } out.println(dirtyText);
When executed, the code produces the following output, truncated:
Call me Ishmael. Some years ago- never mind how long precisely - call me ishmael some years ago never mind how long precisely
Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String
array. The split
method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W
, which represents anything that is not a word character:
out.println(dirtyText); dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); for(String clean : cleanText){ out.print(clean + " "); }
This code produces the same output as shown previously.
Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join
method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join
method joins every word in the array words
and inserts a space between each word:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+"); String cleanText = String.join(" ", words); out.println(cleanText);
Again, this code produces the same output as shown previously. An alternate version of the join
method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner
class:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+"); String cleanText = Joiner.on(" ").skipNulls().join(words); out.println(cleanText);
This version provides additional options, including skipping nulls, as shown before. The output remains the same.
Removing stop words
Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner
object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList
using the Arrays
class's asList
method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String
class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:
Scanner readStop = new Scanner(new File("C://stopwords.txt")); ArrayList<String> words = new ArrayList<String>(Arrays.asList((dirtyText)); out.println("Original clean text: " + words.toString());
We also create a new ArrayList
to hold a list of stop words actually found in our text. This will allow us to use the ArrayList
class removeAll
method shortly. Next, we use our Scanner
to read through our file of stop words. Notice how we also call the toLowerCase
and trim
methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains
method to determine whether our text contains the given stop word. If so, we add it to our foundWords
ArrayList. Once we have processed all the stop words, we call removeAll
to remove them from our text:
ArrayList<String> foundWords = new ArrayList(); while(readStop.hasNextLine()){ String stopWord = readStop.nextLine().toLowerCase(); if(words.contains(stopWord)){ foundWords.add(stopWord); } } words.removeAll(foundWords); out.println("Text without stop words: " + words.toString());
The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:
Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world] Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely
There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory
class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer
method uses a char
array, so we call toCharArray
against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:
text = text.toLowerCase().trim(); TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; fact = new EnglishStopTokenizerFactory(fact); Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); for(String word : tok){ out.print(word + " "); }
The output follows:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .
Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.
The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches
method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains
method and the equals
method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); int count = 0;
Next, we call the contains
method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals
method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:
if(dirtyText.contains(toFind)){ String[] words = dirtyText.split(" "); for(String word : words){ if(word.equals(toFind)){ count++; } } out.println("Found " + toFind + " " + count + " times in the text."); }
In this example, we set toFind
to the letter I
. This produced the following output:
Found i 2 times in the text.
We also have the option to use the Scanner
class to search through an entire file. One helpful method is the findWithinHorizon
method. This uses a Scanner
to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner
will be searched by default:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); Scanner textLine = new Scanner(dirtyText); out.println("Found " + textLine.findWithinHorizon(toFind, 10));
This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.
It can also be more efficient to search an entire file using a BufferedReader
. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader
object from our path and process our file as long as the next line is not empty:
String path = "C:// MobyDick.txt"; try { String textLine = ""; toFind = toFind.toLowerCase().trim(); BufferedReader textToClean = new BufferedReader( new FileReader(path)); while((textLine = textToClean.readLine()) != null){ line++; if(textLine.toLowerCase().trim().contains(toFind)){ out.println("Found " + toFind + " in " + textLine); } } textToClean.close(); } catch (IOException ex) { // Handle exceptions }
We again test our data by searching for the word I
in the first sentences of Moby Dick. The truncated output follows:
Found i in Call me Ishmael...
Finding and replacing text
We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains
method. If we find the text, we call the replaceAll
method to modify our string:
text = text.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); out.println(text); if(text.contains(toFind)){ text = text.replaceAll(toFind, replaceWith); out.println(text); }
To test this code, we set toFind
to the word I
and replaceWith
to Ishmael
. Our output follows:
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world. call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.
Apache Commons also provides a replace
method with several variations in the StringUtils
class. This class provides much of the same functionality as the String
class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me
with X
to demonstrate the replace
method:
out.println(text); out.println(StringUtils.replace(text, "me", "X"));
The truncated output follows:
Call me Ishmael. Some years ago- never mind how long precisely - Call X Ishmael. SoX years ago- never mind how long precisely -
Notice how every instance of me
has been replaced, even those instances contained within other words, such as some.
This can be avoided by adding spaces around me
, although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.
The StringUtils
class also provides a replacePattern
method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:
out.println(text); text = StringUtils.replacePattern(text, "\\W\\s", " "); out.println(text);
This will produce the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - Call me Ishmael Some years ago never mind how long precisely
Google Guava provides additional support for matching and modify text data using the CharMatcher
class. CharMatcher
not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.
In this example, we are going to use the replace
method to simply replace all instances of the word me
with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom
method and print our string again:
text = text.replace("me", " "); out.println("With double spaces: " + text); String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); out.println("With double spaces removed: " + spaced);
Our output is truncated as follows:
With double spaces: Call Ishmael. So years ago- ... With double spaces removed: Call Ishmael. So years ago- ...
Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.
Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.
To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.
We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList
contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:
double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; double sum = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
Notice that for the numbers used in this execution, the output is as follows:
The average temperature is 70.33
Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum
:
double sum = 0; tempList[0] = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
This will change the average temperature displayed in our output:
The average temperature is 66.17
Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.
One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.
When it is essential to handle null data, Java's Optional
class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null
for the purposes of demonstrating these methods:
String useName = ""; String[] nameList = {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; Optional<String> tempName; for(String name : nameList){ tempName = Optional.ofNullable(name); useName = tempName.orElse("DEFAULT"); out.println("Name to use = " + useName); }
We first created a variable called useName
to hold the name we will actually print out. We also created an instance of the Optional
class called tempName
. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional
class ofNullable
method. This method tests whether a particular value is null or not. On the next line, we call the orElse
method to either assign a value from the array to useName
or, if the element is null, assign DEFAULT
. Our output follows:
Name to use = Amy Name to use = Bob Name to use = Sally Name to use = Sue Name to use = Don Name to use = Rick Name to use = DEFAULT Name to use = Betsy
The Optional
class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.
It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet
method of the SortedSet
interface. We will begin by storing a list of numbers in a TreeSet
. We then create a new TreeSet
object to hold the subset retrieved from the list. Next, we print out our original list:
Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new ArrayList<>(Arrays.asList(nums))); SortedSet<Integer> partNumsList; out.println("Original List: " + fullNumsList.toString() + " " + fullNumsList.last());
The subSet
method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12
and 46
:
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); out.println("SubSet of List: " + partNumsList.toString() + " " + partNumsList.size());
Our output follows:
Original List: [12, 14, 34, 44, 46, 52, 87, 123] SubSet of List: [12, 14, 34, 44]
Another option is to use the stream
method in conjunction with the skip
method. The stream
method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList
as in the previous example, but this time we will specify how many elements to skip with the skip
method. We will also use the collect
method to create a new Set
to hold the new elements:
out.println("Original List: " + numsList.toString()); Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); Set<Integer> partNumsList = fullNumsList .stream() .skip(5) .collect(toCollection(TreeSet::new)); out.println("SubSet of List: " + partNumsList.toString());
When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet
, we will actually be omitting the five lowest numbers:
Original List: [12, 46, 52, 34, 87, 123, 14, 44] SubSet of List: [52, 87, 123]
At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader
to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:
try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { br .lines() .filter(s -> !s.equals("")) .forEach(s -> out.println(s)); } catch (IOException ex) { // Handle exceptions }
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Transforming data into a usable form
Data often needs to be cleaned once it has been acquired. Datasets are often inconsistent, are missing in information, and contain extraneous information. In this section, we will examine some simple ways to transform text data to make it more useful and easier to analyse.
Simple text cleaning
We will use the string shown before from Moby Dick to demonstrate some of the basic String
class methods. Notice the use of the toLowerCase
and trim
methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll
method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:
out.println(dirtyText); dirtyText = dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); dirtyText = dirtyText.trim(); while(dirtyText.contains(" ")){ dirtyText = dirtyText.replaceAll(" ", " "); } out.println(dirtyText);
When executed, the code produces the following output, truncated:
Call me Ishmael. Some years ago- never mind how long precisely - call me ishmael some years ago never mind how long precisely
Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String
array. The split
method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W
, which represents anything that is not a word character:
out.println(dirtyText); dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); for(String clean : cleanText){ out.print(clean + " "); }
This code produces the same output as shown previously.
Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join
method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join
method joins every word in the array words
and inserts a space between each word:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+"); String cleanText = String.join(" ", words); out.println(cleanText);
Again, this code produces the same output as shown previously. An alternate version of the join
method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner
class:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+"); String cleanText = Joiner.on(" ").skipNulls().join(words); out.println(cleanText);
This version provides additional options, including skipping nulls, as shown before. The output remains the same.
Removing stop words
Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner
object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList
using the Arrays
class's asList
method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String
class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:
Scanner readStop = new Scanner(new File("C://stopwords.txt")); ArrayList<String> words = new ArrayList<String>(Arrays.asList((dirtyText)); out.println("Original clean text: " + words.toString());
We also create a new ArrayList
to hold a list of stop words actually found in our text. This will allow us to use the ArrayList
class removeAll
method shortly. Next, we use our Scanner
to read through our file of stop words. Notice how we also call the toLowerCase
and trim
methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains
method to determine whether our text contains the given stop word. If so, we add it to our foundWords
ArrayList. Once we have processed all the stop words, we call removeAll
to remove them from our text:
ArrayList<String> foundWords = new ArrayList(); while(readStop.hasNextLine()){ String stopWord = readStop.nextLine().toLowerCase(); if(words.contains(stopWord)){ foundWords.add(stopWord); } } words.removeAll(foundWords); out.println("Text without stop words: " + words.toString());
The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:
Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world] Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely
There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory
class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer
method uses a char
array, so we call toCharArray
against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:
text = text.toLowerCase().trim(); TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; fact = new EnglishStopTokenizerFactory(fact); Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); for(String word : tok){ out.print(word + " "); }
The output follows:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .
Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.
Finding words in text
The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches
method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains
method and the equals
method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); int count = 0;
Next, we call the contains
method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals
method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:
if(dirtyText.contains(toFind)){ String[] words = dirtyText.split(" "); for(String word : words){ if(word.equals(toFind)){ count++; } } out.println("Found " + toFind + " " + count + " times in the text."); }
In this example, we set toFind
to the letter I
. This produced the following output:
Found i 2 times in the text.
We also have the option to use the Scanner
class to search through an entire file. One helpful method is the findWithinHorizon
method. This uses a Scanner
to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner
will be searched by default:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); Scanner textLine = new Scanner(dirtyText); out.println("Found " + textLine.findWithinHorizon(toFind, 10));
This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.
It can also be more efficient to search an entire file using a BufferedReader
. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader
object from our path and process our file as long as the next line is not empty:
String path = "C:// MobyDick.txt"; try { String textLine = ""; toFind = toFind.toLowerCase().trim(); BufferedReader textToClean = new BufferedReader( new FileReader(path)); while((textLine = textToClean.readLine()) != null){ line++; if(textLine.toLowerCase().trim().contains(toFind)){ out.println("Found " + toFind + " in " + textLine); } } textToClean.close(); } catch (IOException ex) { // Handle exceptions }
We again test our data by searching for the word I
in the first sentences of Moby Dick. The truncated output follows:
Found i in Call me Ishmael...
Finding and replacing text
We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains
method. If we find the text, we call the replaceAll
method to modify our string:
text = text.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); out.println(text); if(text.contains(toFind)){ text = text.replaceAll(toFind, replaceWith); out.println(text); }
To test this code, we set toFind
to the word I
and replaceWith
to Ishmael
. Our output follows:
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world. call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.
Apache Commons also provides a replace
method with several variations in the StringUtils
class. This class provides much of the same functionality as the String
class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me
with X
to demonstrate the replace
method:
out.println(text); out.println(StringUtils.replace(text, "me", "X"));
The truncated output follows:
Call me Ishmael. Some years ago- never mind how long precisely - Call X Ishmael. SoX years ago- never mind how long precisely -
Notice how every instance of me
has been replaced, even those instances contained within other words, such as some.
This can be avoided by adding spaces around me
, although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.
The StringUtils
class also provides a replacePattern
method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:
out.println(text); text = StringUtils.replacePattern(text, "\\W\\s", " "); out.println(text);
This will produce the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - Call me Ishmael Some years ago never mind how long precisely
Google Guava provides additional support for matching and modify text data using the CharMatcher
class. CharMatcher
not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.
In this example, we are going to use the replace
method to simply replace all instances of the word me
with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom
method and print our string again:
text = text.replace("me", " "); out.println("With double spaces: " + text); String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); out.println("With double spaces removed: " + spaced);
Our output is truncated as follows:
With double spaces: Call Ishmael. So years ago- ... With double spaces removed: Call Ishmael. So years ago- ...
Data imputation
Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.
Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.
To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.
We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList
contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:
double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; double sum = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
Notice that for the numbers used in this execution, the output is as follows:
The average temperature is 70.33
Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum
:
double sum = 0; tempList[0] = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
This will change the average temperature displayed in our output:
The average temperature is 66.17
Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.
One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.
When it is essential to handle null data, Java's Optional
class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null
for the purposes of demonstrating these methods:
String useName = ""; String[] nameList = {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; Optional<String> tempName; for(String name : nameList){ tempName = Optional.ofNullable(name); useName = tempName.orElse("DEFAULT"); out.println("Name to use = " + useName); }
We first created a variable called useName
to hold the name we will actually print out. We also created an instance of the Optional
class called tempName
. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional
class ofNullable
method. This method tests whether a particular value is null or not. On the next line, we call the orElse
method to either assign a value from the array to useName
or, if the element is null, assign DEFAULT
. Our output follows:
Name to use = Amy Name to use = Bob Name to use = Sally Name to use = Sue Name to use = Don Name to use = Rick Name to use = DEFAULT Name to use = Betsy
The Optional
class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.
Subsetting data
It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet
method of the SortedSet
interface. We will begin by storing a list of numbers in a TreeSet
. We then create a new TreeSet
object to hold the subset retrieved from the list. Next, we print out our original list:
Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new ArrayList<>(Arrays.asList(nums))); SortedSet<Integer> partNumsList; out.println("Original List: " + fullNumsList.toString() + " " + fullNumsList.last());
The subSet
method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12
and 46
:
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); out.println("SubSet of List: " + partNumsList.toString() + " " + partNumsList.size());
Our output follows:
Original List: [12, 14, 34, 44, 46, 52, 87, 123] SubSet of List: [12, 14, 34, 44]
Another option is to use the stream
method in conjunction with the skip
method. The stream
method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList
as in the previous example, but this time we will specify how many elements to skip with the skip
method. We will also use the collect
method to create a new Set
to hold the new elements:
out.println("Original List: " + numsList.toString()); Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); Set<Integer> partNumsList = fullNumsList .stream() .skip(5) .collect(toCollection(TreeSet::new)); out.println("SubSet of List: " + partNumsList.toString());
When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet
, we will actually be omitting the five lowest numbers:
Original List: [12, 46, 52, 34, 87, 123, 14, 44] SubSet of List: [52, 87, 123]
At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader
to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:
try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { br .lines() .filter(s -> !s.equals("")) .forEach(s -> out.println(s)); } catch (IOException ex) { // Handle exceptions }
Sorting text
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Simple text cleaning
We will use the string shown before from Moby Dick to demonstrate some of the basic String
class methods. Notice the use of the toLowerCase
and trim
methods. Datasets often have non-standard casing and extra leading or trailing spaces. These methods ensure uniformity of our dataset. We also use the replaceAll
method twice. In the first instance, we use a regular expression to replace all numbers and anything that is not a word or whitespace character with a single space. The second instance replaces all back-to-back whitespace characters with a single space:
out.println(dirtyText); dirtyText = dirtyText.toLowerCase().replaceAll("[\\d[^\\w\\s]]+", " "); dirtyText = dirtyText.trim(); while(dirtyText.contains(" ")){ dirtyText = dirtyText.replaceAll(" ", " "); } out.println(dirtyText);
When executed, the code produces the following output, truncated:
Call me Ishmael. Some years ago- never mind how long precisely - call me ishmael some years ago never mind how long precisely
Our next example produces the same result but approaches the problem with regular expressions. In this case, we replace all of the numbers and other special characters first. Then we use method chaining to standardize our casing, remove leading and trailing spaces, and split our words into a String
array. The split
method allows you to break apart text on a given delimiter. In this case, we chose to use the regular expression \\W
, which represents anything that is not a word character:
out.println(dirtyText); dirtyText = dirtyText.replaceAll("[\\d[^\\w\\s]]+", ""); String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+"); for(String clean : cleanText){ out.print(clean + " "); }
This code produces the same output as shown previously.
Although arrays are useful for many applications, it is often important to recombine text after cleaning. In the next example, we employ the join
method to combine our words once we have cleaned them. We use the same chained methods as shown previously to clean and split our text. The join
method joins every word in the array words
and inserts a space between each word:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+"); String cleanText = String.join(" ", words); out.println(cleanText);
Again, this code produces the same output as shown previously. An alternate version of the join
method is available using Google Guava. Here is a simple implementation of the same process we used before, but using the Guava Joiner
class:
out.println(dirtyText); String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+"); String cleanText = Joiner.on(" ").skipNulls().join(words); out.println(cleanText);
This version provides additional options, including skipping nulls, as shown before. The output remains the same.
Removing stop words
Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner
object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList
using the Arrays
class's asList
method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String
class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:
Scanner readStop = new Scanner(new File("C://stopwords.txt")); ArrayList<String> words = new ArrayList<String>(Arrays.asList((dirtyText)); out.println("Original clean text: " + words.toString());
We also create a new ArrayList
to hold a list of stop words actually found in our text. This will allow us to use the ArrayList
class removeAll
method shortly. Next, we use our Scanner
to read through our file of stop words. Notice how we also call the toLowerCase
and trim
methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains
method to determine whether our text contains the given stop word. If so, we add it to our foundWords
ArrayList. Once we have processed all the stop words, we call removeAll
to remove them from our text:
ArrayList<String> foundWords = new ArrayList(); while(readStop.hasNextLine()){ String stopWord = readStop.nextLine().toLowerCase(); if(words.contains(stopWord)){ foundWords.add(stopWord); } } words.removeAll(foundWords); out.println("Text without stop words: " + words.toString());
The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:
Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world] Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely
There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory
class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer
method uses a char
array, so we call toCharArray
against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:
text = text.toLowerCase().trim(); TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; fact = new EnglishStopTokenizerFactory(fact); Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); for(String word : tok){ out.print(word + " "); }
The output follows:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .
Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.
The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches
method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains
method and the equals
method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); int count = 0;
Next, we call the contains
method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals
method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:
if(dirtyText.contains(toFind)){ String[] words = dirtyText.split(" "); for(String word : words){ if(word.equals(toFind)){ count++; } } out.println("Found " + toFind + " " + count + " times in the text."); }
In this example, we set toFind
to the letter I
. This produced the following output:
Found i 2 times in the text.
We also have the option to use the Scanner
class to search through an entire file. One helpful method is the findWithinHorizon
method. This uses a Scanner
to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner
will be searched by default:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); Scanner textLine = new Scanner(dirtyText); out.println("Found " + textLine.findWithinHorizon(toFind, 10));
This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.
It can also be more efficient to search an entire file using a BufferedReader
. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader
object from our path and process our file as long as the next line is not empty:
String path = "C:// MobyDick.txt"; try { String textLine = ""; toFind = toFind.toLowerCase().trim(); BufferedReader textToClean = new BufferedReader( new FileReader(path)); while((textLine = textToClean.readLine()) != null){ line++; if(textLine.toLowerCase().trim().contains(toFind)){ out.println("Found " + toFind + " in " + textLine); } } textToClean.close(); } catch (IOException ex) { // Handle exceptions }
We again test our data by searching for the word I
in the first sentences of Moby Dick. The truncated output follows:
Found i in Call me Ishmael...
Finding and replacing text
We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains
method. If we find the text, we call the replaceAll
method to modify our string:
text = text.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); out.println(text); if(text.contains(toFind)){ text = text.replaceAll(toFind, replaceWith); out.println(text); }
To test this code, we set toFind
to the word I
and replaceWith
to Ishmael
. Our output follows:
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world. call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.
Apache Commons also provides a replace
method with several variations in the StringUtils
class. This class provides much of the same functionality as the String
class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me
with X
to demonstrate the replace
method:
out.println(text); out.println(StringUtils.replace(text, "me", "X"));
The truncated output follows:
Call me Ishmael. Some years ago- never mind how long precisely - Call X Ishmael. SoX years ago- never mind how long precisely -
Notice how every instance of me
has been replaced, even those instances contained within other words, such as some.
This can be avoided by adding spaces around me
, although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.
The StringUtils
class also provides a replacePattern
method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:
out.println(text); text = StringUtils.replacePattern(text, "\\W\\s", " "); out.println(text);
This will produce the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - Call me Ishmael Some years ago never mind how long precisely
Google Guava provides additional support for matching and modify text data using the CharMatcher
class. CharMatcher
not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.
In this example, we are going to use the replace
method to simply replace all instances of the word me
with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom
method and print our string again:
text = text.replace("me", " "); out.println("With double spaces: " + text); String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); out.println("With double spaces removed: " + spaced);
Our output is truncated as follows:
With double spaces: Call Ishmael. So years ago- ... With double spaces removed: Call Ishmael. So years ago- ...
Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.
Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.
To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.
We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList
contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:
double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; double sum = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
Notice that for the numbers used in this execution, the output is as follows:
The average temperature is 70.33
Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum
:
double sum = 0; tempList[0] = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
This will change the average temperature displayed in our output:
The average temperature is 66.17
Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.
One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.
When it is essential to handle null data, Java's Optional
class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null
for the purposes of demonstrating these methods:
String useName = ""; String[] nameList = {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; Optional<String> tempName; for(String name : nameList){ tempName = Optional.ofNullable(name); useName = tempName.orElse("DEFAULT"); out.println("Name to use = " + useName); }
We first created a variable called useName
to hold the name we will actually print out. We also created an instance of the Optional
class called tempName
. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional
class ofNullable
method. This method tests whether a particular value is null or not. On the next line, we call the orElse
method to either assign a value from the array to useName
or, if the element is null, assign DEFAULT
. Our output follows:
Name to use = Amy Name to use = Bob Name to use = Sally Name to use = Sue Name to use = Don Name to use = Rick Name to use = DEFAULT Name to use = Betsy
The Optional
class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.
It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet
method of the SortedSet
interface. We will begin by storing a list of numbers in a TreeSet
. We then create a new TreeSet
object to hold the subset retrieved from the list. Next, we print out our original list:
Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new ArrayList<>(Arrays.asList(nums))); SortedSet<Integer> partNumsList; out.println("Original List: " + fullNumsList.toString() + " " + fullNumsList.last());
The subSet
method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12
and 46
:
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); out.println("SubSet of List: " + partNumsList.toString() + " " + partNumsList.size());
Our output follows:
Original List: [12, 14, 34, 44, 46, 52, 87, 123] SubSet of List: [12, 14, 34, 44]
Another option is to use the stream
method in conjunction with the skip
method. The stream
method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList
as in the previous example, but this time we will specify how many elements to skip with the skip
method. We will also use the collect
method to create a new Set
to hold the new elements:
out.println("Original List: " + numsList.toString()); Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); Set<Integer> partNumsList = fullNumsList .stream() .skip(5) .collect(toCollection(TreeSet::new)); out.println("SubSet of List: " + partNumsList.toString());
When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet
, we will actually be omitting the five lowest numbers:
Original List: [12, 46, 52, 34, 87, 123, 14, 44] SubSet of List: [52, 87, 123]
At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader
to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:
try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { br .lines() .filter(s -> !s.equals("")) .forEach(s -> out.println(s)); } catch (IOException ex) { // Handle exceptions }
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Removing stop words
Text analysis sometimes requires the omission of common, non-specific words such as the, and, or but. These words are known as stop words are there are several tools available for removing them from text. There are various ways to store a list of stop words, but for the following examples, we will assume they are contained in a file. To begin, we create a new Scanner
object to read in our stop words. Then we take the text we wish to transform and store it in an ArrayList
using the Arrays
class's asList
method. We will assume here the text has already been cleaned and normalized. It is essential to consider casing when using String
class methods—and is not the same as AND or And, although all three may be stop words you wish to eliminate:
Scanner readStop = new Scanner(new File("C://stopwords.txt")); ArrayList<String> words = new ArrayList<String>(Arrays.asList((dirtyText)); out.println("Original clean text: " + words.toString());
We also create a new ArrayList
to hold a list of stop words actually found in our text. This will allow us to use the ArrayList
class removeAll
method shortly. Next, we use our Scanner
to read through our file of stop words. Notice how we also call the toLowerCase
and trim
methods against each stop word. This is to ensure that our stop words match the formatting in our text. In this example, we employ the contains
method to determine whether our text contains the given stop word. If so, we add it to our foundWords
ArrayList. Once we have processed all the stop words, we call removeAll
to remove them from our text:
ArrayList<String> foundWords = new ArrayList(); while(readStop.hasNextLine()){ String stopWord = readStop.nextLine().toLowerCase(); if(words.contains(stopWord)){ foundWords.add(stopWord); } } words.removeAll(foundWords); out.println("Text without stop words: " + words.toString());
The output will depend upon the words designated as stop words. If your stop words file contains different words than used in this example, your output will differ slightly. Our output follows:
Original clean text: [call, me, ishmael, some, years, ago, never, mind, how, long, precisely, having, little, or, no, money, in, my, purse, and, nothing, particular, to, interest, me, on, shore, i, thought, i, would, sail, about, a, little, and, see, the, watery, part, of, the, world] Text without stop words: [call, ishmael, years, ago, never, mind, how, long, precisely
There is also support outside of the standard Java libraries for removing stop words. We are going to look at one example, using LingPipe. In this example, we start by ensuring that our text is normalized in lowercase and trimmed. Then we create a new instance of the TokenizerFactory
class. We set our factory to use default English stop words and then tokenize the text. Notice that the tokenizer
method uses a char
array, so we call toCharArray
against our text. The second parameter specifies where to begin searching within the text, and the last parameter specifies where to end:
text = text.toLowerCase().trim(); TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE; fact = new EnglishStopTokenizerFactory(fact); Tokenizer tok = fact.tokenizer(text.toCharArray(), 0, text.length()); for(String word : tok){ out.print(word + " "); }
The output follows:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .
Notice the differences between our previous examples. First of all, we did not clean the text as thoroughly and allowed special characters, such as the hyphen, to remain in the text. Secondly, the LingPipe list of stop words differs from the file we used in the previous example. Some words are removed, but LingPipe was less restrictive and allowed more words to remain in the text. The type and number of stop words you use will depend upon your particular application.
The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches
method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains
method and the equals
method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); int count = 0;
Next, we call the contains
method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals
method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:
if(dirtyText.contains(toFind)){ String[] words = dirtyText.split(" "); for(String word : words){ if(word.equals(toFind)){ count++; } } out.println("Found " + toFind + " " + count + " times in the text."); }
In this example, we set toFind
to the letter I
. This produced the following output:
Found i 2 times in the text.
We also have the option to use the Scanner
class to search through an entire file. One helpful method is the findWithinHorizon
method. This uses a Scanner
to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner
will be searched by default:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); Scanner textLine = new Scanner(dirtyText); out.println("Found " + textLine.findWithinHorizon(toFind, 10));
This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.
It can also be more efficient to search an entire file using a BufferedReader
. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader
object from our path and process our file as long as the next line is not empty:
String path = "C:// MobyDick.txt"; try { String textLine = ""; toFind = toFind.toLowerCase().trim(); BufferedReader textToClean = new BufferedReader( new FileReader(path)); while((textLine = textToClean.readLine()) != null){ line++; if(textLine.toLowerCase().trim().contains(toFind)){ out.println("Found " + toFind + " in " + textLine); } } textToClean.close(); } catch (IOException ex) { // Handle exceptions }
We again test our data by searching for the word I
in the first sentences of Moby Dick. The truncated output follows:
Found i in Call me Ishmael...
Finding and replacing text
We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains
method. If we find the text, we call the replaceAll
method to modify our string:
text = text.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); out.println(text); if(text.contains(toFind)){ text = text.replaceAll(toFind, replaceWith); out.println(text); }
To test this code, we set toFind
to the word I
and replaceWith
to Ishmael
. Our output follows:
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world. call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.
Apache Commons also provides a replace
method with several variations in the StringUtils
class. This class provides much of the same functionality as the String
class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me
with X
to demonstrate the replace
method:
out.println(text); out.println(StringUtils.replace(text, "me", "X"));
The truncated output follows:
Call me Ishmael. Some years ago- never mind how long precisely - Call X Ishmael. SoX years ago- never mind how long precisely -
Notice how every instance of me
has been replaced, even those instances contained within other words, such as some.
This can be avoided by adding spaces around me
, although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.
The StringUtils
class also provides a replacePattern
method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:
out.println(text); text = StringUtils.replacePattern(text, "\\W\\s", " "); out.println(text);
This will produce the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - Call me Ishmael Some years ago never mind how long precisely
Google Guava provides additional support for matching and modify text data using the CharMatcher
class. CharMatcher
not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.
In this example, we are going to use the replace
method to simply replace all instances of the word me
with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom
method and print our string again:
text = text.replace("me", " "); out.println("With double spaces: " + text); String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); out.println("With double spaces removed: " + spaced);
Our output is truncated as follows:
With double spaces: Call Ishmael. So years ago- ... With double spaces removed: Call Ishmael. So years ago- ...
Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.
Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.
To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.
We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList
contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:
double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; double sum = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
Notice that for the numbers used in this execution, the output is as follows:
The average temperature is 70.33
Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum
:
double sum = 0; tempList[0] = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
This will change the average temperature displayed in our output:
The average temperature is 66.17
Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.
One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.
When it is essential to handle null data, Java's Optional
class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null
for the purposes of demonstrating these methods:
String useName = ""; String[] nameList = {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; Optional<String> tempName; for(String name : nameList){ tempName = Optional.ofNullable(name); useName = tempName.orElse("DEFAULT"); out.println("Name to use = " + useName); }
We first created a variable called useName
to hold the name we will actually print out. We also created an instance of the Optional
class called tempName
. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional
class ofNullable
method. This method tests whether a particular value is null or not. On the next line, we call the orElse
method to either assign a value from the array to useName
or, if the element is null, assign DEFAULT
. Our output follows:
Name to use = Amy Name to use = Bob Name to use = Sally Name to use = Sue Name to use = Don Name to use = Rick Name to use = DEFAULT Name to use = Betsy
The Optional
class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.
It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet
method of the SortedSet
interface. We will begin by storing a list of numbers in a TreeSet
. We then create a new TreeSet
object to hold the subset retrieved from the list. Next, we print out our original list:
Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new ArrayList<>(Arrays.asList(nums))); SortedSet<Integer> partNumsList; out.println("Original List: " + fullNumsList.toString() + " " + fullNumsList.last());
The subSet
method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12
and 46
:
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); out.println("SubSet of List: " + partNumsList.toString() + " " + partNumsList.size());
Our output follows:
Original List: [12, 14, 34, 44, 46, 52, 87, 123] SubSet of List: [12, 14, 34, 44]
Another option is to use the stream
method in conjunction with the skip
method. The stream
method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList
as in the previous example, but this time we will specify how many elements to skip with the skip
method. We will also use the collect
method to create a new Set
to hold the new elements:
out.println("Original List: " + numsList.toString()); Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); Set<Integer> partNumsList = fullNumsList .stream() .skip(5) .collect(toCollection(TreeSet::new)); out.println("SubSet of List: " + partNumsList.toString());
When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet
, we will actually be omitting the five lowest numbers:
Original List: [12, 46, 52, 34, 87, 123, 14, 44] SubSet of List: [52, 87, 123]
At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader
to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:
try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { br .lines() .filter(s -> !s.equals("")) .forEach(s -> out.println(s)); } catch (IOException ex) { // Handle exceptions }
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Finding words in text
The standard Java libraries offer support for searching through text for specific tokens. In previous examples, we have demonstrated the matches
method and regular expressions, which can be useful when searching text. In this example, however, we will demonstrate a simple technique using the contains
method and the equals
method to locate a particular string. First, we normalize our text and the word we are searching for to ensure we can find a match. We also create an integer variable to hold the number of times the word is found:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); int count = 0;
Next, we call the contains
method to determine whether the word exists in our text. If it does, we split the text into an array and then loop through, using the equals
method to compare each word. If we encounter the word, we increment our counter by one. Finally, we display the output to show how many times our word was encountered:
if(dirtyText.contains(toFind)){ String[] words = dirtyText.split(" "); for(String word : words){ if(word.equals(toFind)){ count++; } } out.println("Found " + toFind + " " + count + " times in the text."); }
In this example, we set toFind
to the letter I
. This produced the following output:
Found i 2 times in the text.
We also have the option to use the Scanner
class to search through an entire file. One helpful method is the findWithinHorizon
method. This uses a Scanner
to parse the text up to a given horizon specification. If zero is used for the second parameter, as shown next, the entire Scanner
will be searched by default:
dirtyText = dirtyText.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); Scanner textLine = new Scanner(dirtyText); out.println("Found " + textLine.findWithinHorizon(toFind, 10));
This technique can be more efficient for locating a particular string, but it does make it more difficult to determine where, and how many times, the string was found.
It can also be more efficient to search an entire file using a BufferedReader
. We specify the file to search and use a try-catch block to catch any IO exceptions. We create a new BufferedReader
object from our path and process our file as long as the next line is not empty:
String path = "C:// MobyDick.txt"; try { String textLine = ""; toFind = toFind.toLowerCase().trim(); BufferedReader textToClean = new BufferedReader( new FileReader(path)); while((textLine = textToClean.readLine()) != null){ line++; if(textLine.toLowerCase().trim().contains(toFind)){ out.println("Found " + toFind + " in " + textLine); } } textToClean.close(); } catch (IOException ex) { // Handle exceptions }
We again test our data by searching for the word I
in the first sentences of Moby Dick. The truncated output follows:
Found i in Call me Ishmael...
Finding and replacing text
We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains
method. If we find the text, we call the replaceAll
method to modify our string:
text = text.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); out.println(text); if(text.contains(toFind)){ text = text.replaceAll(toFind, replaceWith); out.println(text); }
To test this code, we set toFind
to the word I
and replaceWith
to Ishmael
. Our output follows:
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world. call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.
Apache Commons also provides a replace
method with several variations in the StringUtils
class. This class provides much of the same functionality as the String
class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me
with X
to demonstrate the replace
method:
out.println(text); out.println(StringUtils.replace(text, "me", "X"));
The truncated output follows:
Call me Ishmael. Some years ago- never mind how long precisely - Call X Ishmael. SoX years ago- never mind how long precisely -
Notice how every instance of me
has been replaced, even those instances contained within other words, such as some.
This can be avoided by adding spaces around me
, although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.
The StringUtils
class also provides a replacePattern
method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:
out.println(text); text = StringUtils.replacePattern(text, "\\W\\s", " "); out.println(text);
This will produce the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - Call me Ishmael Some years ago never mind how long precisely
Google Guava provides additional support for matching and modify text data using the CharMatcher
class. CharMatcher
not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.
In this example, we are going to use the replace
method to simply replace all instances of the word me
with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom
method and print our string again:
text = text.replace("me", " "); out.println("With double spaces: " + text); String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); out.println("With double spaces removed: " + spaced);
Our output is truncated as follows:
With double spaces: Call Ishmael. So years ago- ... With double spaces removed: Call Ishmael. So years ago- ...
Data imputation
Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.
Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.
To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.
We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList
contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:
double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; double sum = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
Notice that for the numbers used in this execution, the output is as follows:
The average temperature is 70.33
Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum
:
double sum = 0; tempList[0] = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
This will change the average temperature displayed in our output:
The average temperature is 66.17
Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.
One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.
When it is essential to handle null data, Java's Optional
class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null
for the purposes of demonstrating these methods:
String useName = ""; String[] nameList = {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; Optional<String> tempName; for(String name : nameList){ tempName = Optional.ofNullable(name); useName = tempName.orElse("DEFAULT"); out.println("Name to use = " + useName); }
We first created a variable called useName
to hold the name we will actually print out. We also created an instance of the Optional
class called tempName
. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional
class ofNullable
method. This method tests whether a particular value is null or not. On the next line, we call the orElse
method to either assign a value from the array to useName
or, if the element is null, assign DEFAULT
. Our output follows:
Name to use = Amy Name to use = Bob Name to use = Sally Name to use = Sue Name to use = Don Name to use = Rick Name to use = DEFAULT Name to use = Betsy
The Optional
class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.
Subsetting data
It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet
method of the SortedSet
interface. We will begin by storing a list of numbers in a TreeSet
. We then create a new TreeSet
object to hold the subset retrieved from the list. Next, we print out our original list:
Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new ArrayList<>(Arrays.asList(nums))); SortedSet<Integer> partNumsList; out.println("Original List: " + fullNumsList.toString() + " " + fullNumsList.last());
The subSet
method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12
and 46
:
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); out.println("SubSet of List: " + partNumsList.toString() + " " + partNumsList.size());
Our output follows:
Original List: [12, 14, 34, 44, 46, 52, 87, 123] SubSet of List: [12, 14, 34, 44]
Another option is to use the stream
method in conjunction with the skip
method. The stream
method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList
as in the previous example, but this time we will specify how many elements to skip with the skip
method. We will also use the collect
method to create a new Set
to hold the new elements:
out.println("Original List: " + numsList.toString()); Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); Set<Integer> partNumsList = fullNumsList .stream() .skip(5) .collect(toCollection(TreeSet::new)); out.println("SubSet of List: " + partNumsList.toString());
When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet
, we will actually be omitting the five lowest numbers:
Original List: [12, 46, 52, 34, 87, 123, 14, 44] SubSet of List: [52, 87, 123]
At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader
to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:
try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { br .lines() .filter(s -> !s.equals("")) .forEach(s -> out.println(s)); } catch (IOException ex) { // Handle exceptions }
Sorting text
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Finding and replacing text
We often not only want to find text but also replace it with something else. We begin our next example much like we did the previous examples, by specifying our text, our text to locate, and invoking the contains
method. If we find the text, we call the replaceAll
method to modify our string:
text = text.toLowerCase().trim(); toFind = toFind.toLowerCase().trim(); out.println(text); if(text.contains(toFind)){ text = text.replaceAll(toFind, replaceWith); out.println(text); }
To test this code, we set toFind
to the word I
and replaceWith
to Ishmael
. Our output follows:
call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, i thought i would sail about a little and see the watery part of the world. call me ishmael. some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, Ishmael thought Ishmael would sail about a little and see the watery part of the world.
Apache Commons also provides a replace
method with several variations in the StringUtils
class. This class provides much of the same functionality as the String
class, but with more flexibility and options. In the following example, we use our string from Moby Dick and replace all instances of the word me
with X
to demonstrate the replace
method:
out.println(text); out.println(StringUtils.replace(text, "me", "X"));
The truncated output follows:
Call me Ishmael. Some years ago- never mind how long precisely - Call X Ishmael. SoX years ago- never mind how long precisely -
Notice how every instance of me
has been replaced, even those instances contained within other words, such as some.
This can be avoided by adding spaces around me
, although this will ignore any instances where me is at the end of the sentence, like me. We will examine a better alternative using Google Guava in a moment.
The StringUtils
class also provides a replacePattern
method that allows you to search for and replace text based upon a regular expression. In the following example, we replace all non-word characters, such as hyphens and commas, with a single space:
out.println(text); text = StringUtils.replacePattern(text, "\\W\\s", " "); out.println(text);
This will produce the following truncated output:
Call me Ishmael. Some years ago- never mind how long precisely - Call me Ishmael Some years ago never mind how long precisely
Google Guava provides additional support for matching and modify text data using the CharMatcher
class. CharMatcher
not only allows you to find data matching a particular char pattern, but also provides options as to how to handle the data. This includes allowing you to retain the data, replace the data, and trim whitespaces from within a particular string.
In this example, we are going to use the replace
method to simply replace all instances of the word me
with a single space. This will produce series of empty spaces within our text. We will then collapse the extra whitespace using the trimAndCollapseFrom
method and print our string again:
text = text.replace("me", " "); out.println("With double spaces: " + text); String spaced = CharMatcher.WHITESPACE.trimAndCollapseFrom(text, ' '); out.println("With double spaces removed: " + spaced);
Our output is truncated as follows:
With double spaces: Call Ishmael. So years ago- ... With double spaces removed: Call Ishmael. So years ago- ...
Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.
Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.
To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.
We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList
contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:
double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; double sum = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
Notice that for the numbers used in this execution, the output is as follows:
The average temperature is 70.33
Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum
:
double sum = 0; tempList[0] = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
This will change the average temperature displayed in our output:
The average temperature is 66.17
Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.
One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.
When it is essential to handle null data, Java's Optional
class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null
for the purposes of demonstrating these methods:
String useName = ""; String[] nameList = {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; Optional<String> tempName; for(String name : nameList){ tempName = Optional.ofNullable(name); useName = tempName.orElse("DEFAULT"); out.println("Name to use = " + useName); }
We first created a variable called useName
to hold the name we will actually print out. We also created an instance of the Optional
class called tempName
. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional
class ofNullable
method. This method tests whether a particular value is null or not. On the next line, we call the orElse
method to either assign a value from the array to useName
or, if the element is null, assign DEFAULT
. Our output follows:
Name to use = Amy Name to use = Bob Name to use = Sally Name to use = Sue Name to use = Don Name to use = Rick Name to use = DEFAULT Name to use = Betsy
The Optional
class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.
It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet
method of the SortedSet
interface. We will begin by storing a list of numbers in a TreeSet
. We then create a new TreeSet
object to hold the subset retrieved from the list. Next, we print out our original list:
Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new ArrayList<>(Arrays.asList(nums))); SortedSet<Integer> partNumsList; out.println("Original List: " + fullNumsList.toString() + " " + fullNumsList.last());
The subSet
method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12
and 46
:
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); out.println("SubSet of List: " + partNumsList.toString() + " " + partNumsList.size());
Our output follows:
Original List: [12, 14, 34, 44, 46, 52, 87, 123] SubSet of List: [12, 14, 34, 44]
Another option is to use the stream
method in conjunction with the skip
method. The stream
method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList
as in the previous example, but this time we will specify how many elements to skip with the skip
method. We will also use the collect
method to create a new Set
to hold the new elements:
out.println("Original List: " + numsList.toString()); Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); Set<Integer> partNumsList = fullNumsList .stream() .skip(5) .collect(toCollection(TreeSet::new)); out.println("SubSet of List: " + partNumsList.toString());
When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet
, we will actually be omitting the five lowest numbers:
Original List: [12, 46, 52, 34, 87, 123, 14, 44] SubSet of List: [52, 87, 123]
At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader
to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:
try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { br .lines() .filter(s -> !s.equals("")) .forEach(s -> out.println(s)); } catch (IOException ex) { // Handle exceptions }
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Data imputation
Data imputation refers to the process of identifying and replacing missing data in a given dataset. In almost any substantial case of data analysis, missing data will be an issue, and it needs to be addressed before data can be properly analysed. Trying to process data that is missing information is a lot like trying to understand a conversation where every once in while a word is dropped. Sometimes we can understand what is intended. In other situations, we may be completely lost as to what is trying to be conveyed.
Among statistical analysts, there exist differences of opinion as to how missing data should be handled but the most common approaches involve replacing missing data with a reasonable estimate or with an empty or null value.
To prevent skewing and misalignment of data, many statisticians advocate for replacing missing data with values representative of the average or expected value for that dataset. The methodology for determining a representative value and assigning it to a location within the data will vary depending upon the data and we cannot illustrate every example in this chapter. However, for example, if a dataset contained a list of temperatures across a range of dates, and one date was missing a temperature, that date can be assigned a temperature that was the average of the temperatures within the dataset.
We will examine a rather trivial example to demonstrate the issues surrounding data imputation. Let's assume the variable tempList
contains average temperature data for each month of one year. Then we perform a simple calculation of the average and print out our results:
double[] tempList = {50,56,65,70,74,80,82,90,83,78,64,52}; double sum = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
Notice that for the numbers used in this execution, the output is as follows:
The average temperature is 70.33
Next we will mimic missing data by changing the first element of our array to zero before we calculate our sum
:
double sum = 0; tempList[0] = 0; for(double d : tempList){ sum += d; } out.printf("The average temperature is %1$,.2f", sum/12);
This will change the average temperature displayed in our output:
The average temperature is 66.17
Notice that while this change may seem rather minor, it is statistically significant. Depending upon the variation within a given dataset and how far the average is from zero or some other substituted value, the results of a statistical analysis may be significantly skewed. This does not mean zero should never be used as a substitute for null or otherwise invalid values, but other alternatives should be considered.
One alternative approach can be to calculate the average of the values in the array, excluding zeros or nulls, and then substitute the average in each position with missing data. It is important to consider the type of data and purpose of data analysis when making these decisions. For example, in the preceding example, will zero always be an invalid average temperature? Perhaps not if the temperatures were averages for Antarctica.
When it is essential to handle null data, Java's Optional
class provides helpful solutions. Consider the following example, where we have a list of names stored as an array. We have set one value to null
for the purposes of demonstrating these methods:
String useName = ""; String[] nameList = {"Amy","Bob","Sally","Sue","Don","Rick",null,"Betsy"}; Optional<String> tempName; for(String name : nameList){ tempName = Optional.ofNullable(name); useName = tempName.orElse("DEFAULT"); out.println("Name to use = " + useName); }
We first created a variable called useName
to hold the name we will actually print out. We also created an instance of the Optional
class called tempName
. We will use this to test whether a value in the array is null or not. We then loop through our array and create and call the Optional
class ofNullable
method. This method tests whether a particular value is null or not. On the next line, we call the orElse
method to either assign a value from the array to useName
or, if the element is null, assign DEFAULT
. Our output follows:
Name to use = Amy Name to use = Bob Name to use = Sally Name to use = Sue Name to use = Don Name to use = Rick Name to use = DEFAULT Name to use = Betsy
The Optional
class contains several other methods useful for handling potential null data. Although there are other ways to handle such instances, this Java 8 addition provides simpler and more elegant solutions to a common data analysis problem.
Subsetting data
It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet
method of the SortedSet
interface. We will begin by storing a list of numbers in a TreeSet
. We then create a new TreeSet
object to hold the subset retrieved from the list. Next, we print out our original list:
Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new ArrayList<>(Arrays.asList(nums))); SortedSet<Integer> partNumsList; out.println("Original List: " + fullNumsList.toString() + " " + fullNumsList.last());
The subSet
method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12
and 46
:
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); out.println("SubSet of List: " + partNumsList.toString() + " " + partNumsList.size());
Our output follows:
Original List: [12, 14, 34, 44, 46, 52, 87, 123] SubSet of List: [12, 14, 34, 44]
Another option is to use the stream
method in conjunction with the skip
method. The stream
method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList
as in the previous example, but this time we will specify how many elements to skip with the skip
method. We will also use the collect
method to create a new Set
to hold the new elements:
out.println("Original List: " + numsList.toString()); Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); Set<Integer> partNumsList = fullNumsList .stream() .skip(5) .collect(toCollection(TreeSet::new)); out.println("SubSet of List: " + partNumsList.toString());
When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet
, we will actually be omitting the five lowest numbers:
Original List: [12, 46, 52, 34, 87, 123, 14, 44] SubSet of List: [52, 87, 123]
At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader
to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:
try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { br .lines() .filter(s -> !s.equals("")) .forEach(s -> out.println(s)); } catch (IOException ex) { // Handle exceptions }
Sorting text
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Subsetting data
It is not always practical or desirable to work with an entire set of data. In these cases, we may want to retrieve a subset of data to either work with or remove entirely from the dataset. There are a few ways of doing this supported by the standard Java libraries. First, we will use the subSet
method of the SortedSet
interface. We will begin by storing a list of numbers in a TreeSet
. We then create a new TreeSet
object to hold the subset retrieved from the list. Next, we print out our original list:
Integer[] nums = {12, 46, 52, 34, 87, 123, 14, 44}; TreeSet<Integer> fullNumsList = new TreeSet<Integer>(new ArrayList<>(Arrays.asList(nums))); SortedSet<Integer> partNumsList; out.println("Original List: " + fullNumsList.toString() + " " + fullNumsList.last());
The subSet
method takes two parameters, which specify the range of integers within the data we want to retrieve. The first parameter is included in the results while the second is exclusive. In our example that follows, we want to retrieve a subset of all numbers between the first number in our array 12
and 46
:
partNumsList = fullNumsList.subSet(fullNumsList.first(), 46); out.println("SubSet of List: " + partNumsList.toString() + " " + partNumsList.size());
Our output follows:
Original List: [12, 14, 34, 44, 46, 52, 87, 123] SubSet of List: [12, 14, 34, 44]
Another option is to use the stream
method in conjunction with the skip
method. The stream
method returns a Java 8 Stream instance which iterates over the set. We will use the same numsList
as in the previous example, but this time we will specify how many elements to skip with the skip
method. We will also use the collect
method to create a new Set
to hold the new elements:
out.println("Original List: " + numsList.toString()); Set<Integer> fullNumsList = new TreeSet<Integer>(numsList); Set<Integer> partNumsList = fullNumsList .stream() .skip(5) .collect(toCollection(TreeSet::new)); out.println("SubSet of List: " + partNumsList.toString());
When we print out the new subset, we get the following output where the first five elements of the sorted set are skipped. Because it is a SortedSet
, we will actually be omitting the five lowest numbers:
Original List: [12, 46, 52, 34, 87, 123, 14, 44] SubSet of List: [52, 87, 123]
At times, data will begin with blank lines or header lines that we wish to remove from our dataset to be analysed. In our final example, we will read data from a file and remove all blank lines. We use a BufferedReader
to read our data and employ a lambda expression to test for a blank line. If the line is not blank, we print the line to the screen:
try (BufferedReader br = new BufferedReader(new FileReader("C:\\text.txt"))) { br .lines() .filter(s -> !s.equals("")) .forEach(s -> out.println(s)); } catch (IOException ex) { // Handle exceptions }
Sorting text
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Sorting text
Sometimes it is necessary to sort data during the cleaning process. The standard Java library provides several resources for accomplishing different types of sorts, with improvements added with the release of Java 8. In our first example, we will use the Comparator
interface in conjunction with a lambda expression.
We start by declaring our Comparator
variable compareInts
. The first set of parenthesis after the equals sign contains the parameters to be passed to our method. Within the lambda expression, we call the compare
method, which determines which integer is larger:
Comparator<Integer> compareInts = (Integer first, Integer second) -> Integer.compare(first, second);
We can now call the sort
method as we did previously:
Collections.sort(numsList,compareInts); out.println("Sorted integers using Lambda: " + numsList.toString());
Our output follows:
Sorted integers using Lambda: [12, 14, 34, 44, 46, 52, 87, 123]
We then mimic the process with our wordsList
. Notice the use of the compareTo
method rather than compare
:
Comparator<String> compareWords = (String first, String second) -> first.compareTo(second); Collections.sort(wordsList,compareWords); out.println("Sorted words using Lambda: " + wordsList.toString());
When this code is executed, we should see the following output:
Sorted words using Lambda: [boat, cat, dog, house, road, zoo]
In our next example, we are going to use the Collections
class to perform basic sorting on String
and integer data. For this example, wordList
and numsList
are both ArrayList
and are initialized as follows:
List<String> wordsList = Stream.of("cat", "dog", "house", "boat", "road", "zoo") .collect(Collectors.toList()); List<Integer> numsList = Stream.of(12, 46, 52, 34, 87, 123, 14, 44) .collect(Collectors.toList());
First, we will print our original version of each list followed by a call to the sort
method. We then display our data, sorted in ascending fashion:
out.println("Original Word List: " + wordsList.toString()); Collections.sort(wordsList); out.println("Ascending Word List: " + wordsList.toString()); out.println("Original Integer List: " + numsList.toString()); Collections.sort(numsList); out.println("Ascending Integer List: " + numsList.toString());
The output follows:
Original Word List: [cat, dog, house, boat, road, zoo] Ascending Word List: [boat, cat, dog, house, road, zoo] Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Ascending Integer List: [12, 14, 34, 44, 46, 52, 87, 123]
Next, we will replace the sort
method with the reverse
method of the Collections
class in our integer data example. This method simply takes the elements and stores them in reverse order:
out.println("Original Integer List: " + numsList.toString()); Collections.reverse(numsList); out.println("Reversed Integer List: " + numsList.toString());
The output displays our new numsList
:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Reversed Integer List: [44, 14, 123, 87, 34, 52, 46, 12]
In our next example, we handle the sort using the Comparator
interface. We will continue to use our numsList
and assume that no sorting has occurred yet. First we create two objects that implement the Comparator
interface. The sort
method will use these objects to determine the desired order when comparing two elements. The expression Integer::compare
is a Java 8 method reference. This is can be used where a lambda expression is used:
out.println("Original Integer List: " + numsList.toString()); Comparator<Integer> basicOrder = Integer::compare; Comparator<Integer> descendOrder = basicOrder.reversed(); Collections.sort(numsList,descendOrder); out.println("Descending Integer List: " + numsList.toString());
After we execute this code, we will see the following output:
Original Integer List: [12, 46, 52, 34, 87, 123, 14, 44] Descending Integer List: [123, 87, 52, 46, 44, 34, 14, 12]
In our last example, we will attempt a more complex sort involving two comparisons. Let's assume there is a Dog
class that contains two properties, name
and age
, along with the necessary accessor methods. We will begin by adding elements to a new ArrayList
and then printing the names and ages of each Dog
:
ArrayList<Dogs> dogs = new ArrayList<Dogs>(); dogs.add(new Dogs("Zoey", 8)); dogs.add(new Dogs("Roxie", 10)); dogs.add(new Dogs("Kylie", 7)); dogs.add(new Dogs("Shorty", 14)); dogs.add(new Dogs("Ginger", 7)); dogs.add(new Dogs("Penny", 7)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output should resemble:
Name Age Zoey 8 Roxie 10 Kylie 7 Shorty 14 Ginger 7 Penny 7
Next, we are going to use method chaining and the double colon operator to reference methods from the Dog
class. We first call comparing
followed by thenComparing
to specify the order in which comparisons should occur. When we execute the code, we expect to see the Dog
objects sorted first by Name
and then by Age
:
dogs.sort(Comparator.comparing(Dogs::getName).thenComparing(Dogs::getAge)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
Our output follows:
Name Age Ginger 7 Kylie 7 Penny 7 Roxie 10 Shorty 14 Zoey 8
Now we will switch the order of comparison. Notice how the age of the dog takes priority over the name in this version:
dogs.sort(Comparator.comparing(Dogs::getAge).thenComparing(Dogs::getName)); out.println("Name " + " Age"); for(Dogs d : dogs){ out.println(d.getName() + " " + d.getAge()); }
And our output is:
Name Age Ginger 7 Kylie 7 Penny 7 Zoey 8 Roxie 10 Shorty 14
Data validation
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Data validation
Data validation is an important part of data science. Before we can analyze and manipulate data, we need to verify that the data is of the type expected. We have organized our code into simple methods designed to accomplish very basic validation tasks. The code within these methods can be adapted into existing applications.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Validating data types
Sometimes we simply need to validate whether a piece of data is of a specific type, such as integer or floating point data. We will demonstrate in the next example how to validate integer data using the validateIn
t method. This technique is easily modified for the other major data types supported in the standard Java library, including Float
and Double
.
We need to use a try-catch block here to catch a NumberFormatException
. If an exception is thrown, we know our data is not a valid integer. We first pass our text to be tested to the parseInt
method of the Integer
class. If the text can be parsed as an integer, we simply print out the integer. If an exception is thrown, we display information to that effect:
public static void validateInt(String toValidate){ try{ int validInt = Integer.parseInt(toValidate); out.println(validInt + " is a valid integer"); }catch(NumberFormatException e){ out.println(toValidate + " is not a valid integer"); }
We will use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The Apache Commons contain an IntegerValidator
class with additional useful functionalities. In this first example, we simply duplicate the process from before, but use IntegerValidator
methods to accomplish our goal:
public static String validateInt(String text){ IntegerValidator intValidator = IntegerValidator.getInstance(); if(intValidator.isValid(text)){ return text + " is a valid integer"; }else{ return text + " is not a valid integer"; } }
We again use the following method calls to test our method:
validateInt("1234"); validateInt("Ishmael");
The output follows:
1234 is a valid integer Ishmael is not a valid integer
The IntegerValidator
class also provides methods to determine whether an integer is greater than or less than a specific value, compare the number to a ranger of numbers, and convert Number
objects to Integer
objects. Apache Commons has a number of other validator classes. We will examine a few more in the rest of this section.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Validating dates
Many times our data validation is more complex than simply determining whether a piece of data is the correct type. When we want to verify that the data is a date for example, it is insufficient to simply verify that it is made up of integers. We may need to include hyphens and slashes, or ensure that the year is in two-digit or four-digit format.
To do this, we have created another simple method called validateDate
. The method takes two String
parameters, one to hold the date to validate and the other to hold the acceptable date format. We create an instance of the SimpleDateFormat
class using the format specified in the parameter. Then we call the parse
method to convert our String
date to a Date
object. Just as in our previous integer example, if the data cannot be parsed as a date, an exception is thrown and the method returns. If, however, the String
can be parsed to a date, we simply compare the format of the test date with our acceptable format to determine whether the date is valid:
public static String validateDate(String theDate, String dateFormat){ try { SimpleDateFormat format = new SimpleDateFormat(dateFormat); Date test = format.parse(theDate); if(format.format(test).equals(theDate)){ return theDate.toString() + " is a valid date"; }else{ return theDate.toString() + " is not a valid date"; } } catch (ParseException e) { return theDate.toString() + " is not a valid date"; } }
We make the following method calls to test our method:
String dateFormat = "MM/dd/yyyy"; out.println(validateDate("12/12/1982",dateFormat)); out.println(validateDate("12/12/82",dateFormat)); out.println(validateDate("Ishmael",dateFormat));
The output follows:
12/12/1982 is a valid date 12/12/82 is not a valid date Ishmael is not a valid date
This example highlights why it is important to consider the restrictions you place on data. Our second method call did contain a legitimate date, but it was not in the format we specified. This is good if we are looking for very specifically formatted data. But we also run the risk of leaving out useful data if we are too restrictive in our validation.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Validating e-mail addresses
It is also common to need to validate e-mail addresses. While most e-mail addresses have the @
symbol and require at least one period after the symbol, there are many variations. Consider that each of the following examples can be valid e-mail addresses:
myemail@mail.com
MyEmail@some.mail.com
My.Email.123!@mail.net
One option is to use regular expressions to attempt to capture all allowable e-mail addresses. Notice that the regular expression used in the method that follows is very long and complex. This can make it easy to make mistakes, miss valid e-mail addresses, or accept invalid addresses as valid. But a carefully crafted regular expression can be a very powerful tool.
We use the Pattern
and Matcher
classes to compile and execute our regular expression. If the pattern of the e-mail we pass in matches the regular expression we defined, we will consider that text to be a valid e-mail address:
public static String validateEmail(String email) { String emailRegex = "^[a-zA-Z0-9.!$'*+/=?^_`{|}~-" + "]+@((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\." + "[0-9]{1,3}\\])|(([a-zAZ\\-0-9]+\\.)+[a-zA-Z]{2,}))$"; Pattern.compile(emailRegex); Matcher matcher = pattern.matcher(email); if(matcher.matches()){ return email + " is a valid email address"; }else{ return email + " is not a valid email address"; } }
We make the following method calls to test our data:
out.println(validateEmail("myemail@mail.com")); out.println(validateEmail("My.Email.123!@mail.net")); out.println(validateEmail("myEmail"));
The output follows:
myemail@mail.com is a valid email address My.Email.123!@mail.net is a valid email address myEmail is not a valid email address
There is a standard Java library for validating e-mail addresses as well. In this example, we use the InternetAddress
class to validate whether a given string is a valid e-mail address or not:
public static String validateEmailStandard(String email){ try{ InternetAddress testEmail = new InternetAddress(email); testEmail.validate(); return email + " is a valid email address"; }catch(AddressException e){ return email + " is not a valid email address"; } }
When tested against the same data as in the previous example, our output is identical. However, consider the following method call:
out.println(validateEmailStandard("myEmail@mail"));
Despite not being in standard e-mail format, the output is as follows:
myEmail@mail is a valid email address
Additionally, the validate
method by default accepts local e-mail addresses as valid. This is not always desirable, depending upon the purpose of the data.
One last option we will look at is the Apache Commons EmailValidator
class. This class's isValid
method examines an e-mail address and determines whether it is valid or not. Our validateEmail
method shown previously is modified as follows to use EmailValidator
:
public static String validateEmailApache(String email){ email = email.trim(); EmailValidator eValidator = EmailValidator.getInstance(); if(eValidator.isValid(email)){ return email + " is a valid email address."; }else{ return email + " is not a valid email address."; } }
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Validating ZIP codes
Postal codes are generally formatted specific to their country or local requirements. For this reason, regular expressions are useful because they can accommodate any postal code required. The example that follows demonstrates how to validate a standard United States postal code, with or without the hyphen and last four digits:
public static void validateZip(String zip){ String zipRegex = "^[0-9]{5}(?:-[0-9]{4})?$"; Pattern pattern = Pattern.compile(zipRegex); Matcher matcher = pattern.matcher(zip); if(matcher.matches()){ out.println(zip + " is a valid zip code"); }else{ out.println(zip + " is not a valid zip code"); } }
We make the following method calls to test our data:
out.println(validateZip("12345")); out.println(validateZip("12345-6789")); out.println(validateZip("123"));
The output follows:
12345 is a valid zip code 12345-6789 is a valid zip code 123 is not a valid zip code
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Validating names
Names can be especially tricky to validate because there are so many variations. There are no industry standards or technical limitations, other than what characters are available on the keyboard. For this example, we have chosen to use Unicode in our regular expression because it allows us to match any character from any language. The Unicode property \\p{L}
provides this flexibility. We also use \\s-'
, to allow spaces, apostrophes, commas, and hyphens in our name fields. It is possible to perform string cleaning, as discussed earlier in this chapter, before attempting to match names. This will simplify the regular expression required:
public static void validateName(String name){ String nameRegex = "^[\\p{L}\\s-',]+$"; Pattern pattern = Pattern.compile(nameRegex); Matcher matcher = pattern.matcher(name); if(matcher.matches()){ out.println(name + " is a valid name"); }else{ out.println(name + " is not a valid name"); } }
We make the following method calls to test our data:
validateName("Bobby Smith, Jr."); validateName("Bobby Smith the 4th"); validateName("Albrecht Müller"); validateName("François Moreau");
The output follows:
Bobby Smith, Jr. is a valid name Bobby Smith the 4th is not a valid name Albrecht Müller is a valid name François Moreau is a valid name
Notice that the comma and period in Bobby Smith, Jr.
are acceptable, but the 4
in 4th
is not. Additionally, the special characters in François
and Müller
are considered valid.
Cleaning images
While image processing is a complex task, we will introduce a few techniques to clean and extract information from an image. This will provide the reader with some insight into image processing. We will also demonstrate how to extract text data from an image using Optical Character Recognition (OCR).
There are several techniques used to improve the quality of an image. Many of these require tweaking of parameters to get the improvement desired. We will demonstrate how to:
- Enhance an image's contrast
- Smooth an image
- Brighten an image
- Resize an image
- Convert images to different formats
We will use OpenCV (http://opencv.org/), an open source project for image processing. There are several classes that we will use:
Mat
: This represents an n-dimensional array holding image data such as channel, grayscale, or color valuesImgproc
: Possesses many methods that process an imageImgcodecs
: Possesses methods to read and write image files
The OpenCV Javadocs is found at http://docs.opencv.org/java/2.4.9/. In the examples that follow, we will use Wikipedia images since they can be freely downloaded. Specifically we will use the following images:
Changing the contrast of an image
Here we will demonstrate how to enhance a black-and-white image of a parrot. The Imgcodecs
class's imread
method reads in the image. Its second parameter specifies the type of color used by the image, which is grayscale in this case. A new Mat
object is created for the enhanced image using the same size and color type as the original.
The actual work is performed by the equalizeHist
method. This equalizes the histogram of the image which has the effect of normalizing the brightness and increases the contrast of the image. An image histogram is a histogram representing the tonal distribution of an image. Tonal is also referred to as lightness. It represents the variation in the brightness found in an image.
The last step is to write out the image.
Mat source = Imgcodecs.imread("GrayScaleParrot.png", Imgcodecs.CV_LOAD_IMAGE_GRAYSCALE); Mat destination = new Mat(source.rows(), source.cols(), source.type()); Imgproc.equalizeHist(source, destination); Imgcodecs.imwrite("enhancedParrot.jpg", destination);
The following is the original image:
The enhanced image follows:
Smoothing an image
Smoothing an image, also called blurring, will make the edges of an image smoother. Blurring is the process of making an image less distinct. We recognize blurred objects when we take a picture with the camera out of focus. Blurring can be used for special effects. Here, we will use it to create an image that we will then sharpen.
The following example loads an image of a cat and repeatedly applies the blur
method to the image. In this example, the process is repeated 25
times. Increasing the number of iterations will result in more blur or smoothing.
The third argument of the blur method is the blurring kernel size. The kernel is a matrix of pixels, 3 by 3 in this example, that is used for convolution. This is the process of multiplying each element of an image by weighted values of its neighbors. This allows neighboring values to effect an element's value:
Mat source = Imgcodecs.imread("cat.jpg"); Mat destination = source.clone(); for (int i = 0; i < 25; i++) { Mat sourceImage = destination.clone(); Imgproc.blur(sourceImage, destination, new Size(3.0, 3.0)); } Imgcodecs.imwrite("smoothCat.jpg", destination);
The following is the original image:
The enhanced image follows:
Brightening an image
The convertTo
method provides a means of brightening an image. The original image is copied to a new image where the contrast and brightness is adjusted. The first parameter is the destination image. The second specifies that the type of image should not be changed. The third and fourth parameters control the contrast and brightness respectively. The first value is multiplied by this value while the second is added to the multiplied value:
Mat source = Imgcodecs.imread("cat.jpg"); Mat destination = new Mat(source.rows(), source.cols(), source.type()); source.convertTo(destination, -1, 1, 50); Imgcodecs.imwrite("brighterCat.jpg", destination);
The enhanced image follows:
Resizing an image
Sometimes it is desirable to resize an image. The resize
method shown next illustrates how this is done. The image is read in and a new Mat
object is created. The resize
method is then applied where the width and height are specified in the Size
object parameter. The resized image is then saved:
Mat source = Imgcodecs.imread("cat.jpg"); Mat resizeimage = new Mat(); Imgproc.resize(source, resizeimage, new Size(250, 250)); Imgcodecs.imwrite("resizedCat.jpg", resizeimage);
The enhanced image follows:
Converting images to different formats
Another common operation is to convert an image that uses one format into an image that uses a different format. In OpenCV, this is easy to accomplish as shown next. The image is read in and then immediately written out. The extension of the file is used by the imwrite
method to convert the image to the new format:
Mat source = Imgcodecs.imread("cat.jpg"); Imgcodecs.imwrite("convertedCat.jpg", source); Imgcodecs.imwrite("convertedCat.jpeg", source); Imgcodecs.imwrite("convertedCat.webp", source); Imgcodecs.imwrite("convertedCat.png", source); Imgcodecs.imwrite("convertedCat.tiff", source);
The images can now be used for specialized processing if necessary.
Changing the contrast of an image
Here we will demonstrate how to enhance a black-and-white image of a parrot. The Imgcodecs
class's imread
method reads in the image. Its second parameter specifies the type of color used by the image, which is grayscale in this case. A new Mat
object is created for the enhanced image using the same size and color type as the original.
The actual work is performed by the equalizeHist
method. This equalizes the histogram of the image which has the effect of normalizing the brightness and increases the contrast of the image. An image histogram is a histogram representing the tonal distribution of an image. Tonal is also referred to as lightness. It represents the variation in the brightness found in an image.
The last step is to write out the image.
Mat source = Imgcodecs.imread("GrayScaleParrot.png", Imgcodecs.CV_LOAD_IMAGE_GRAYSCALE); Mat destination = new Mat(source.rows(), source.cols(), source.type()); Imgproc.equalizeHist(source, destination); Imgcodecs.imwrite("enhancedParrot.jpg", destination);
The following is the original image:
The enhanced image follows:
Smoothing an image
Smoothing an image, also called blurring, will make the edges of an image smoother. Blurring is the process of making an image less distinct. We recognize blurred objects when we take a picture with the camera out of focus. Blurring can be used for special effects. Here, we will use it to create an image that we will then sharpen.
The following example loads an image of a cat and repeatedly applies the blur
method to the image. In this example, the process is repeated 25
times. Increasing the number of iterations will result in more blur or smoothing.
The third argument of the blur method is the blurring kernel size. The kernel is a matrix of pixels, 3 by 3 in this example, that is used for convolution. This is the process of multiplying each element of an image by weighted values of its neighbors. This allows neighboring values to effect an element's value:
Mat source = Imgcodecs.imread("cat.jpg"); Mat destination = source.clone(); for (int i = 0; i < 25; i++) { Mat sourceImage = destination.clone(); Imgproc.blur(sourceImage, destination, new Size(3.0, 3.0)); } Imgcodecs.imwrite("smoothCat.jpg", destination);
The following is the original image:
The enhanced image follows:
Brightening an image
The convertTo
method provides a means of brightening an image. The original image is copied to a new image where the contrast and brightness is adjusted. The first parameter is the destination image. The second specifies that the type of image should not be changed. The third and fourth parameters control the contrast and brightness respectively. The first value is multiplied by this value while the second is added to the multiplied value:
Mat source = Imgcodecs.imread("cat.jpg"); Mat destination = new Mat(source.rows(), source.cols(), source.type()); source.convertTo(destination, -1, 1, 50); Imgcodecs.imwrite("brighterCat.jpg", destination);
The enhanced image follows:
Resizing an image
Sometimes it is desirable to resize an image. The resize
method shown next illustrates how this is done. The image is read in and a new Mat
object is created. The resize
method is then applied where the width and height are specified in the Size
object parameter. The resized image is then saved:
Mat source = Imgcodecs.imread("cat.jpg"); Mat resizeimage = new Mat(); Imgproc.resize(source, resizeimage, new Size(250, 250)); Imgcodecs.imwrite("resizedCat.jpg", resizeimage);
The enhanced image follows:
Converting images to different formats
Another common operation is to convert an image that uses one format into an image that uses a different format. In OpenCV, this is easy to accomplish as shown next. The image is read in and then immediately written out. The extension of the file is used by the imwrite
method to convert the image to the new format:
Mat source = Imgcodecs.imread("cat.jpg"); Imgcodecs.imwrite("convertedCat.jpg", source); Imgcodecs.imwrite("convertedCat.jpeg", source); Imgcodecs.imwrite("convertedCat.webp", source); Imgcodecs.imwrite("convertedCat.png", source); Imgcodecs.imwrite("convertedCat.tiff", source);
The images can now be used for specialized processing if necessary.
Smoothing an image
Smoothing an image, also called blurring, will make the edges of an image smoother. Blurring is the process of making an image less distinct. We recognize blurred objects when we take a picture with the camera out of focus. Blurring can be used for special effects. Here, we will use it to create an image that we will then sharpen.
The following example loads an image of a cat and repeatedly applies the blur
method to the image. In this example, the process is repeated 25
times. Increasing the number of iterations will result in more blur or smoothing.
The third argument of the blur method is the blurring kernel size. The kernel is a matrix of pixels, 3 by 3 in this example, that is used for convolution. This is the process of multiplying each element of an image by weighted values of its neighbors. This allows neighboring values to effect an element's value:
Mat source = Imgcodecs.imread("cat.jpg"); Mat destination = source.clone(); for (int i = 0; i < 25; i++) { Mat sourceImage = destination.clone(); Imgproc.blur(sourceImage, destination, new Size(3.0, 3.0)); } Imgcodecs.imwrite("smoothCat.jpg", destination);
The following is the original image:
The enhanced image follows:
Brightening an image
The convertTo
method provides a means of brightening an image. The original image is copied to a new image where the contrast and brightness is adjusted. The first parameter is the destination image. The second specifies that the type of image should not be changed. The third and fourth parameters control the contrast and brightness respectively. The first value is multiplied by this value while the second is added to the multiplied value:
Mat source = Imgcodecs.imread("cat.jpg"); Mat destination = new Mat(source.rows(), source.cols(), source.type()); source.convertTo(destination, -1, 1, 50); Imgcodecs.imwrite("brighterCat.jpg", destination);
The enhanced image follows:
Resizing an image
Sometimes it is desirable to resize an image. The resize
method shown next illustrates how this is done. The image is read in and a new Mat
object is created. The resize
method is then applied where the width and height are specified in the Size
object parameter. The resized image is then saved:
Mat source = Imgcodecs.imread("cat.jpg"); Mat resizeimage = new Mat(); Imgproc.resize(source, resizeimage, new Size(250, 250)); Imgcodecs.imwrite("resizedCat.jpg", resizeimage);
The enhanced image follows:
Converting images to different formats
Another common operation is to convert an image that uses one format into an image that uses a different format. In OpenCV, this is easy to accomplish as shown next. The image is read in and then immediately written out. The extension of the file is used by the imwrite
method to convert the image to the new format:
Mat source = Imgcodecs.imread("cat.jpg"); Imgcodecs.imwrite("convertedCat.jpg", source); Imgcodecs.imwrite("convertedCat.jpeg", source); Imgcodecs.imwrite("convertedCat.webp", source); Imgcodecs.imwrite("convertedCat.png", source); Imgcodecs.imwrite("convertedCat.tiff", source);
The images can now be used for specialized processing if necessary.
Brightening an image
The convertTo
method provides a means of brightening an image. The original image is copied to a new image where the contrast and brightness is adjusted. The first parameter is the destination image. The second specifies that the type of image should not be changed. The third and fourth parameters control the contrast and brightness respectively. The first value is multiplied by this value while the second is added to the multiplied value:
Mat source = Imgcodecs.imread("cat.jpg"); Mat destination = new Mat(source.rows(), source.cols(), source.type()); source.convertTo(destination, -1, 1, 50); Imgcodecs.imwrite("brighterCat.jpg", destination);
The enhanced image follows:
Resizing an image
Sometimes it is desirable to resize an image. The resize
method shown next illustrates how this is done. The image is read in and a new Mat
object is created. The resize
method is then applied where the width and height are specified in the Size
object parameter. The resized image is then saved:
Mat source = Imgcodecs.imread("cat.jpg"); Mat resizeimage = new Mat(); Imgproc.resize(source, resizeimage, new Size(250, 250)); Imgcodecs.imwrite("resizedCat.jpg", resizeimage);
The enhanced image follows:
Converting images to different formats
Another common operation is to convert an image that uses one format into an image that uses a different format. In OpenCV, this is easy to accomplish as shown next. The image is read in and then immediately written out. The extension of the file is used by the imwrite
method to convert the image to the new format:
Mat source = Imgcodecs.imread("cat.jpg"); Imgcodecs.imwrite("convertedCat.jpg", source); Imgcodecs.imwrite("convertedCat.jpeg", source); Imgcodecs.imwrite("convertedCat.webp", source); Imgcodecs.imwrite("convertedCat.png", source); Imgcodecs.imwrite("convertedCat.tiff", source);
The images can now be used for specialized processing if necessary.
Resizing an image
Sometimes it is desirable to resize an image. The resize
method shown next illustrates how this is done. The image is read in and a new Mat
object is created. The resize
method is then applied where the width and height are specified in the Size
object parameter. The resized image is then saved:
Mat source = Imgcodecs.imread("cat.jpg"); Mat resizeimage = new Mat(); Imgproc.resize(source, resizeimage, new Size(250, 250)); Imgcodecs.imwrite("resizedCat.jpg", resizeimage);
The enhanced image follows:
Converting images to different formats
Another common operation is to convert an image that uses one format into an image that uses a different format. In OpenCV, this is easy to accomplish as shown next. The image is read in and then immediately written out. The extension of the file is used by the imwrite
method to convert the image to the new format:
Mat source = Imgcodecs.imread("cat.jpg"); Imgcodecs.imwrite("convertedCat.jpg", source); Imgcodecs.imwrite("convertedCat.jpeg", source); Imgcodecs.imwrite("convertedCat.webp", source); Imgcodecs.imwrite("convertedCat.png", source); Imgcodecs.imwrite("convertedCat.tiff", source);
The images can now be used for specialized processing if necessary.
Converting images to different formats
Another common operation is to convert an image that uses one format into an image that uses a different format. In OpenCV, this is easy to accomplish as shown next. The image is read in and then immediately written out. The extension of the file is used by the imwrite
method to convert the image to the new format:
Mat source = Imgcodecs.imread("cat.jpg"); Imgcodecs.imwrite("convertedCat.jpg", source); Imgcodecs.imwrite("convertedCat.jpeg", source); Imgcodecs.imwrite("convertedCat.webp", source); Imgcodecs.imwrite("convertedCat.png", source); Imgcodecs.imwrite("convertedCat.tiff", source);
The images can now be used for specialized processing if necessary.
Summary
Many times, half the battle in data science is manipulating data so that it is clean enough to work with. In this chapter, we examined many techniques for taking real-world, messy data and transforming it into workable datasets. This process is generally known as data cleaning, wrangling, reshaping, or munging. Our focus was on core Java techniques, but we also examined third-party libraries.
Before we can clean data, we need to have a solid understanding of the format of our data. We discussed CSV data, spreadsheets, PDF, and JSON file types, as well as provided several examples of manipulating text file data. As we examined text data, we looked at multiple approaches for processing the data, including tokenizers, Scanners
, and BufferedReaders
. We showed ways to perform simple cleaning operations, remove stop words, and perform find and replace functions.
This chapter also included a discussion on data imputation and the importance of identifying and rectifying missing data situations. Missing data can cause problems during data analysis and we proposed different methods for dealing with this problem. We demonstrated how to retrieve subsets of data and sort data as well.
Finally, we discussed image cleaning and demonstrated several methods of modifying image data. This included changing contrast, smoothing, brightening, and resizing information. We concluded with a discussion on extracting text imposed on an image.
With this background, we will introduce basic statistical methods and their Java support in the next chapter.