You're reading from Mastering Text Mining with R Extract and recognize your text data

Product type Paperback

Published in Dec 2016

Publisher Packt

ISBN-13 9781783551811

Length 258 pages

Edition 1st Edition

Languages

Concepts

Data Mining

Author (1):

KUMAR ASHISH

View More author details

Table of Contents (9) Chapters

Preface

1. Statistical Linguistics with R FREE CHAPTER

2. Processing Text

3. Categorizing and Tagging Text

4. Dimensionality Reduction

5. Text Summarization and Clustering

6. Text Classification

7. Entity Recognition

Index

Quantitative methods in linguistics

Text can be grammatically complex, to analyze text its difficult consider all the complexity. In order to get meaning out of text or a document, we need a measure. We need to extract quantitative data by processing the text with various transformation methods. Each method discards unnecessary, ancillary information. There are various methods, packages, APIs, and software that can transform text into quantitative data, but before using any of them, we need to analyze and test our documents with different approaches.

The first step is we assume a document is a collection of words where order doesn't influence our analysis. We consider unigrams; for some analysis, bigrams and trigrams can also be used to provide more meaningful results. Next, we simplify the vocabulary by passing the document through a stemming process; here, we reduce the words to their root. A better/advanced approach would be lemmatization. Then we discard punctuation, capitalization, stop words, and very common words. Now we use this text for quantitative analysis. Let me list a few quantity methods and explain why they are used.

Document term matrix

In order to find the similarity between documents in a corpus, we can use a document term matrix. In a document term matrix, rows represent documents, columns represent terms, and each cell value is the term frequency count for a document. It is one of the useful ways of modeling documents. Here is an example:

Document-1: Ice creams in summer are awesome
Document-2: I love ice creams in summer
Document-3: Ice creams are awesome all season

icecream

summer

love

awesome

season

Doc1

1

1

0

1

0

Doc2

1

1

1

0

0

Doc3

1

0

0

1

1

	icecream	summer	love	awesome	season
Doc1	1	1	0	1	0
Doc2	1	1	1	0	0
Doc3	1	0	0	1	1

If we visualize this in a term-document space, each document becomes a point in it. We can then tell how similar two documents are by calculating the distance between the two points using Euclidean distance.

When a term occurs in a lot of documents, it tends to make notably less difference the terms that occur few times. For example, India Today has more to do with India than today. These frequently occurring terms can affect the similarity comparison. The term space will be biased towards these terms. In order to address this problem, we use inverse document frequency.

Inverse document frequency

A commonly used measure of a term's selective potential is calculated by its inverse document frequency (IDF). The formula for IDF is calculated as follows:

Here, N is the number of documents in the corpus and df(term) is the number of documents in which the term appears.

The weight of a term's appearance in a document is calculated by combining the term frequency (TF) in the document with its IDF:

This term–document score is known as TF*IDF, and is widely used. This is used by a lot of search platforms/APIs, such as SOLR, Elasticsearch, and lucene. TF*IDF scores are then pre-computed and stored, so that similarity comparison can be done by just a dot product.

When we look at the entries in this term–document matrix, most of the cells will be empty because only a few terms appear in each document; storing all the empty cells requires a lot of memory and it contributes no value to the dot product (similarity computation). Various sparse matrix representations are possible and these are used to for optimized query processing.

Words similarity and edit-distance functions

In order to find a similarity between words in case of fuzzy searches, we need to quantify the similarity between words; some quantitative methods used are explained below. Before going into it, let's install an R package, stringdist, which can be used to apply various algorithms mentioned above to calculate string similarity:

install.packages("stringdist")
library(stringdist)

One way of finding the similarity between two words is by edit distance. Edit distance refers to the number of operations required to transform one string into another.

Euclidean distance

Euclidean distance is the distance between two points in the term-document space; it can be calculated by using the formula for a two-dimensional space as follows:

Euclidean distance e <- sqrt((x1-x2)^2+(y1-y2)^2)

Here, (x1,y1) and (x2,y2) are the two points and e is the estimated Euclidean distance between them:

We can very easily convert the aforesaid formula into R code:

euclidean.dist <- function(x1, x2) sqrt(sum((x1 - x2) ^ 2))

Cosine similarity

Euclidean distance has its own pitfalls, documents with lots of terms are far from origin, we will find small documents relatively similar even if it's unrelated because of short distance.

To avoid length issues, we can use the angular distance and measure the similarity by the angle between the vectors; we measure the cosine of the angle. The larger the cosine value, the more similar the documents are. Since we use the cos function, this is also called cosine similarity:

The formula to calculate cosine between two points is as follows:

This kind of geometric modeling is also called vector space model:

# Create two random matrices matrixA and matrixB
ncol<-5
nrow<-5
matrixA<-matrix(runif(ncol*nrow), ncol=ncol) 
matrixB<-matrix(runif(ncol*nrow), ncol=ncol) 

# function for estimating cosine similarity in R: 
cosine_sim<-function(matrixA, matrixB){
  m=tcrossprod(matrixA, matrixB)
  c1=sqrt(apply(matrixA, 1, crossprod))
  c2=sqrt(apply(matrixB, 1, crossprod))
  m / outer(c1,c2)
}

# Estimate the cosine similarity between the two matrices initiated earlier
cosine_sim(matrixA,matrixB)

Alternately, cosine similarity can also be estimated by functions available in the packages lsa, proxy, and so on.

Levenshtein distance

The Levenshtein distance between two words, x and y, is the minimal number of insertions, deletions, and replacements needed for transforming word x into word y.

If we to convert abcd to abdc, we need to replace c with d and replace d with c so the distance is 2:

Library(stringdist)
stringdist('abcd', 'abdc', method='lv')
     [1] 2

Damerau-Levenshtein distance

The Damerau-Levenshtein distance is the minimal number of insertions, deletions, replacements, and adjacent transpositions needed for transforming word x into word y.

If we to convert abcd to abdc, we need to swap c and d so the distance is 1:

stringdist('abcd', 'abdc', method='dl')
      [1] 1

Hamming distance

The Hamming distance between two words is the number of positions at which the characters are different. It is the minimum number of substitutions required to change into word into another. In order to use the Hamming distance, the words must be of the same length.

If we to convert abcd to abdc, we need to substitute c with d and d with c so the distance is 2:

stringdist('abcd', 'abdc', method='hamming')
     [1] 2

Jaro-Winkler distance

The Jaro-Winkler distance measure is best suited for short strings such as name comparison or record linkage. It is designed to compare surnames and names. The higher the distance, the more similar the strings in comparison are.

In order to measure the Jaro distance, we need to perform the following two tasks:

Compute the number of matches
Compute the number of transpositions

The Winkler adjustment involves a final rescoring based on an exact match score for the initial characters of both words. It uses a constant scaling factor P:

stringdist('abcd', 'abdc' , method = 'jw' , p=0.1)
      [1] 0.06666667

Measuring readability of a text

Readability is the ease with which a text can be read by a reader. The readability of a text depends on its content and the complexity of its vocabulary and syntax. There are a number of methods to measure the readability of a text. Most of them are based on correlation analysis, where researchers have selected a number of text properties (such as words per sentence, average number of syllables per word, and so on) and then asked test subjects to grade the readability of various texts on a scale. By looking at the text properties of these texts, it is possible to correlate how much "words per sentence" influences readability.

Note

The koRpus package in R provides a hyphen function to estimate the readability of a given text.

Gunning frog index

The Gunning fog index is one of the best-known methods that measure the level of reading difficulty of any document. The fog index level translates the number of years of education a reader needs in order to understand the given material. The ideal score is 7 or 8; anything above 12 is too hard for most people to read.

The Gunning fog index is calculated as shown in the following steps:

Select all the sentences in a passage of approximately 100 words.
We need to calculate the average sentence length by doing a simple math of dividing number of words by number of sentences.
Count all the words with three or more syllables. Generally, words with more than three syllables are considered to be complex.
Sum up the average sentence length and the percentage of complex words.
Multiply the result by 0.4.
The formula is as shown here: