Defining DR
It is a most commonly accepted rule of thumb that it is difficult to understand or visualize data represented in or by more than three dimensions.
Dimensional (-ity) reduction is the process of attempting to reduce the number of random variables (or data dimensions) under statistical consideration, or perhaps better put: finding a lower-dimensional representation of the feature-set that is of interest.
This allows the data scientist to:
- Avoid what is referred to as the curse of dimensionality
Note
The curse of dimensionality refers to a phenomenon that arises when attempting to analyze data in high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings or everyday experience.
- Reduce the amount of time and memory required for the proper analysis of the data
- Allow the data to be more easily visualized
- Eliminate features irrelevant to the model's purpose
- Reduce model noise
A useful (albeit perhaps over-used) conceptual example of data dimensional reduction is the case of a computer-generated face or faces or an image of a single human face, which is in fact made up of thousands of images of individual human faces. If we consider the attributes of each individual face, the data may become overwhelming; however, if we reduce the dimensionality of all those images into several principal components (eyes, nose, lips, and so on.), the data becomes somewhat more manageable.
The following sections outline some of the most common methods and strategies for dimension reduction.
Correlated data analyses
It is typical to think of the terms dependence and association as having the same meeting. These are used to qualify a relationship between two (or more) random variables or bivariate data.
Note
A random variable is a variable quantity whose value depends on possible outcomes; bivariate data is data with two variables who may or may not have an exposed relationship.
Correlated data, or data with correlation, describes a type of (typically linear) statistical relationship. A popular example of a correlation is product pricing, such as when the popularity of a product drives the manufacturer's pricing strategies.
Identifying correlations can be very useful as they can be a predictive relationship that can be exploited or used to reduce dimensionality within a population or data file.
Common examples of correlation and predictive relationships typically involve the weather, but another idea might be found at http://www.nfl.com/. If you are familiar with the national football league and have visited the site, then you'll know that each NFL team sells team-embolized merchandise and a team that's earned a winning season will most likely have higher product sales that year.
In this example, there is a causal relationship, because a team's winning season causes its fans to purchase more of their team merchandise. However, in general, the presence of a correlation is not sufficient to infer the presence of (even) a causal relationship (there will be more on this later in this chapter).
Scatterplots
As an aside, a scatterplot is often used to graphically represent the relationship between two variables and therefore is a great choice for visualizing correlated data.
Using the R plot
function, you can easily generate a pretty nice visual of our winning team example, shown as follows:
Note
As other visualization options, boxplots and violin plots can also be used.
As mentioned in the preceding graph, correlation describes or measures the level or extent to which two or more variables fluctuate together (in the preceding example, Games Won and Merchandise Sold).
This measurement can be categorized as positive or negative, with a positive correlation showing the extent to which those variables increase or decrease in parallel, and a negative correlation showing the extent to which one variable increases as the other decreases.
A correlation coefficient is a statistical measure of the degree to which changes in the value of one variable will or can predict change to the value of another variable.
When the variation of one variable reliably predicts a similar variation in another variable, there's often a predisposition to think that this means that the change in one causes the change in the other. However, correlation does not imply causation. [There may be, for example, an unknown factor that influences both variables similarly].
To illustrate, think of a data correlation situation where television advertising has suggested that athletes who wear a certain brand of shoe run faster. However, those same athletes each employee personal trainers, which may be an influential factor.
So, correlation (or performing correlation analysis) is a statistical technique that can show whether and how strongly pairs of variables are related. When variables are identified that are strongly related, it makes statistical sense to remove one of them from the analysis; however, when pairs of variables appear related but have a weaker relationship, it's best to have both variables remain in the population.
For example, winning professional football teams and team merchandise sales are related in that teams with winning seasons frequently sell more merchandise. However, some teams have a stronger following then others and have high merchandise sales even when they lose more games than they win. Nonetheless, the average sales of a team winning more than 50% of its games is more than one losing 50% of their games, and teams winning more than 75% of their games exceed sales of those losing 75% of their games, and so on. So, what is the effect of winning games on a team's merchandise sales? It can be difficult to determine, but determining correlation between the data points can tell you just how much of the variation in a team's performance is related to their merchandise sales.
Although the correlation of games won and merchandise sold may be obvious, our example may contain unsuspected data correlations. You may also suspect there are additional correlations, but are not sure which are the strongest.
A well-planned, thorough correlation analysis on the data can lead to a greater understanding of your data; however, just like all statistical techniques, correlation is only appropriate for certain types of data. Correlation works for quantifiable data in which numbers are meaningful - usually quantities of some sort (such as products sold). It cannot be used for purely categorical data, such as an individual's gender, brand preference, or education level.
Let's go on and look at causation a bit more closely.
Causation
It is very important to understand the concept of causation and how it compares to correlation.
Causation is defined in statistics as a variable that can cause change to or within another variable. The result of this effect can always be predicted, providing a clear relationship between variables that can be established with certainty.
Causation involves correlation, but correlation does not imply causation. Every variable somehow linked to another may appear to imply causation. This is not always so; linking one thing with another does not always prove that the result has been caused by the other. The rule of thumb is: only if you can directly link a variance or change of a variable to that of another, can you say it's causation.
The degree of correlation
To have a way of indicating or quantifying a statistical relationship between variables, we use a number referred to as a correlation coefficient:
- It ranges from -1.0 to +1.0
- The closer it is to +1 or -1, the more closely the two variables are related
- If zero, there is no relationship between the variables it represents
- If positive, it means that, as one variable increases, the other increases
- If negative, it means that, as one variable increases, the other decreases
Reporting on correlation
While correlation coefficients are normally reported simply as-is (a value between -1 and +1), they are often squared first to make them easier to understand.
If a correlation coefficient is r, then r squared (you remove the decimal point) equals the percentage of the variation in one variable that is related to the variation in the other. Given this, a correlation coefficient of .5 would mean that the variation percentage is 25%.
In an earlier section of this chapter, we looked at a simple visualization of a correlation between the number of games won and the sales of a team's merchandise. In practice, when creating a correlation report, you can also show a second figure – statistical significance. Adding significance will show the probability of error within the identified correlation information. Finally, since sample size can impact outcomes significantly, it is a good practice to also show the sample size.
In summary, identifying correlations within the data being observed is a common and accepted method for accomplishing dimensionality reduction. Another method is principal component analysis, which we will cover in the next section.
Principal component analysis
Principal component analysis (PCA), is another popular statistical technique used for dimensional reduction in data.
Note
PCA is also called the Karhunen-Loeve transform
method, and in fact, depending upon the audience, PCA has been said to be the most commonly used dimension reduction technique.
PCA is a technique that attempts to not only reduce data dimensionality but also retain as much of the variation in the data as possible.
Note
Principal component analysis is an approach to factor analysis (which will be discussed later in this chapter) that considers the total variance in the data file.
The process of PCA uses what is known as an orthogonal transformation process to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables. These variables are then called principal components, or the data's principal modes of variation.
Through the PCA effort, the number of principal components (or variables) should be less than or equal to the number of original variables or the number of original observations, thereby reducing the data's independent dimensionality (that is, dimensional reduction) or the number of independent dimensions.
These principal components are defined and arranged such that the first principal component accounts for as much of the variability in the data as possible, and each succeeding principal component has the next highest variance possible subject to the constraint that it is orthogonal to the preceding components.
The general concept or objective for performing a principal component analysis is to observe that the same results will be obtained for affecting any of the independent variables upon the dependent variable, regardless of whether one models the effects of the variables individually.
PCA is mostly used as a tool when performing an exploratory data analysis, or data profiling, since its operation can be thought of as revealing the internal structure of the data in a way that best explains the variance in the data, but it can also be helpful in predictive modeling.
Note
Data profiling involves logically getting to know the data through query, experimentation, and review. Following the profiling process, you can then use the information you have collected to add context (and/or apply new perspectives) to the data. Adding context to data requires the manipulation of the data to perhaps reformat it, such as by adding calculations, aggregations or additional columns or re-ordering, and so on.
Principal component analysis or PCA is an alternative form of the common factor analysis (which will be discussed later in this chapter) process. Factor analysis typically incorporates more domain-specific assumptions about the underlying structure of the data being observed and solves eigenvectors of a slightly different matrix.
To understand, at a very high level, think of PCA as continually fitting an n-dimensional ellipsoid to a plotted data file, where each axis of the ellipsoid represents a principal component of that data. If some axis of the ellipsoid is small, then the variance along that axis is also small, and by omitting that axis (and its corresponding principal component) from the data file, we will lose only a commensurately small amount of information about our data file; otherwise, the axis (the principal component) stays, representing some degree of variation to the mean of the data file overall.
Using R to understand PCA
Perhaps with less of effort and logic required to understand PCA, as described in the preceding paragraph, you can use the generic R function princomp
.
The princomp
function is a way of simplifying a complex data file by exposing the sources of variations within the data using calculated standard deviations of the data file's principal components. This is best illustrated using the classic iris data file (provided when you install the R programming language).
The data file (partially shown in the following screenshot) contains flower attributes (petal width, petal length, sepal width, and sepal length) for over 150 species of iris
:
We should note that the princomp
function cannot handle string data (which is okay since our PCA analysis is only interested in identifying the numerical variations of principal components), so we can save the first five columns of data to a new object named pca
(dropping the column Species
) by using the following R line of code:
pca<-princomp(iris[-5])
Next, we can use the R summary
command on the pca
object we just created. The output generated is shown as follows:
You can see from the previous image that 92.4% of the variation in the dataset is explained by the first component alone, and 97.8% is explained by the first two components!
To better understand, we can visualize the preceding observations by using the R screeplot
function with our pca
object, shown as follows:
The screeplot
function generates a screen plot displaying the proportion of the total variation in a data file that is explained by each of the components in a principal component analysis, showing how many of the principal components are needed to summarize the data.
Our generated visualization is as follows:
The results of a PCA are usually deliberated in terms of the scores of each component (sometimes referred to as factor scores), which are the transformed variable values corresponding to a particular data point in a data file, and loadings, which are the weight by which each standardized original variable should be multiplied to get the component score.
We can use the R loadings
and scores
commands to view our loadings and scores information, shown as follows:
Independent component analysis
Yet another concept concerning dimension reduction is ICA, or independent component analysis (ICA). This is a process where there is an attempt to uncover or verify statistically independent variables or dimensions in high-dimensional data sources.
Using one selected ICA process, each variable or dimension in the data can be identified, examined for independence, and then selectively removed from, or retained in, the overall data analysis.
Note
If the reader takes time to perform any additional research on ICA, he/she will encounter the common example application called the cocktail party problem, which is listening in on one person's speech in a noisy room.
Defining independence
ICA attempts to find independent components in data by maximizing the statistical independence of the components. But just how is a variable's independence defined?
Components are determined to be independent if the realization of one does not affect the probability distribution of the other.
There are many acceptable ways to determine independence, and your choice will determine the form of the ICA algorithm used.
The two most widely used definitions of independence (for ICA) are:
- Minimization of mutual information (MMI): Mutual information measures the information that two or more components share, measuring to what extent knowing one of these variables reduces uncertainty about the other. The less mutual information a component includes, the more independent the component is.
- Non-Gaussianity Maximization (NGM): Non-Gaussianity Maximization looks to avoid or reduce the average, or in other words, highlight variability (its level of independence) in a component.
ICA pre-processing
Before applying any ICA logic to data, typical ICA methodologies use procedures such as centering and whitening as preprocessing steps in order to simplify and reduce the complexity of a problem and highlight features of data not readily explained by its average or co-variance.
In other words, before attempting to determine the level of a component independence, the data file may be reviewed and manipulated to make it more understandable using various preprocessing methods.
Centering is the most basic preprocess method commonly used, which, true to its name, involves centering a data point (or x) by subtracting its mean, thus making it a zero-mean variable. Whitening, another preprocessing method, transforms a data point (x) linearly so that its components are uncorrelated and equal unity.
Factor analysis
Factor analysis is yet another important statistical method used to determine and describe the data's (or the data components') variability between observed, correlated variables versus (a potentially) lower number of unobserved (or also referred to as latent) variables, or the data factors.
Note
Observed variables are those variables for which you should have clearly defined measurements in a data file, whereas unobserved variables are those variables for which you do not have clearly defined measurements and that perhaps are inferred from certain observed variables within your data file.
In other words, we consider: Is it possible that variations in six observed variables reflect the same variations found in only two unobserved variables?
Factor analysis should/can be used when a data file includes a large number of observed variables that seem to reflect a smaller number of unobserved variables, giving us an opportunity for dimensional reduction, reducing the number of elements to be studied, and observing how they are interlinked.
Overall, factor analysis involves using techniques to help yield a smaller number of linear combinations of variables that, even though there is a reduced number of variables, account for and explain the majority of the variance in the data's components.
In short, performing a factor analysis on a data file attempts to search for cooperative variations in previously unobserved variables.
Explore and confirm
Typically, one would begin the exercise of factor analysis with an exploration of the data file, exploring probable underlying factor structures of a set(s) of the data's observed variables without imposing a predetermined outcome. This process is referred to as exploratory factor analysis (EFA).
During the exploratory factor analysis phase, we attempt to determine the number of unobserved or hidden variables and come up with a means of explaining variation within the data using a smaller number of hidden variables; in other words, we are condensing the information required for observation.
Once determinations are made (one or more hypothesis is formed), one would then want to confirm (or test or validate) the factor structure that was revealed during the EFA process. This step is known widely as confirmatory factor analysis (CFA).
Using R for factor analysis
As usual, the R programming language provides various ways of performing a proper factor analysis.
For example, we have the R function factanal
. This function performs a maximum-likelihood factor analysis on a numeric matrix. In its simplest form, this function needs x (your numeric matrix object) and the number of factors to be considered (or fitted):
factanal(x, factors = 3)
We can run a simple example here for clarification based upon the R documentation and a number of generic functions.
First, a numeric matrix is constructed by combining lists of random numeric values into six variables saved as R vectors (v1 through v6).
The R code is shown as follows:
> v1 <- c(1,1,1,1,1,1,1,1,1,1,3,3,3,3,3,4,5,6) > v2 <- c(1,2,1,1,1,1,2,1,2,1,3,4,3,3,3,4,6,5) > v3 <- c(3,3,3,3,3,1,1,1,1,1,1,1,1,1,1,5,4,6) > v4 <- c(3,3,4,3,3,1,1,2,1,1,1,1,2,1,1,5,6,4) > v5 <- c(1,1,1,1,1,3,3,3,3,3,1,1,1,1,1,6,4,5) > v6 <- c(1,1,1,2,1,3,3,3,4,3,1,1,1,2,1,6,5,4)
The next step is to create a numeric matrix from our six variables, calling it m1
.
To accomplish this, we can use the R cbind
function:
> m1 <- cbind(v1,v2,v3,v4,v5,v6)
The following screenshot shows our code executed, along with a summary of our object m1
:
The R function summary
provides us with some interesting details about our six variables, such as the minimum and maximum values, median, and the mean.
Now that we have a numeric matrix, we can review the variances between our six variables by running the R cor
function.
The following screenshot shows the output that is generated from the cor
function:
Interesting information, but let's move on.
Finally, we are now ready to use (the R function) factanal
.
Again, using the function's simplest form - simply providing the name of the data to analyze (m1
) and the number of factors to consider (let's use 3
) - the following output is generated for us:
The output
The R function factanal
first calculates uniqueness. Uniqueness is the variance that is unique to each variable and not shared with other variables.
Factor loadings are also calculated and displayed in a factor matrix. The factor matrix is the matrix that contains the factor loadings of all the variables on all the factors extracted. The term factor loadings denotes the simple correlations between the factors and the variables, such as the correlation between the observed score and the unobserved score. Generally, the higher the better.
Notice the final message generated in the preceding screenshot:
The degrees of freedom for the model is 0 and the fit was 0.4755
The number of degrees of freedom can be defined as the minimum number of independent coordinates that can specify the position of the system completely (therefore, a zero is not so good). So, if we experiment a bit by increasing the numbers of factors to four, we see that factanal
is smart enough to tell us that, with only six variables, four is too many factors.
The following screenshot shows (globally) the output generated using four factors:
Given the results shown, let us now try decreasing the number of factors to two:
Notice that this time we see that two factors are sufficient.
Based upon the preceding, the results of our simple factor analysis seem to have improved but does this mean that the number of variables could correctly describe the data?
Obviously, more data, more variables and more experimentation are required!
NNMF
The term factorization refers to the process or act of factoring, which is the breakdown of an object into a result (of other objects, or factors) that, when multiplied together, equal the original. Matrix factorization then, is factorizing a matrix, or finding two (or more) matrices that will equal the original matrix when you multiply them.
Non-negative matrix factorization (NMF, NNMF) is using an algorithm to factorize a matrix into (usually) two matrices with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to analyze.