Introduction to Feature Engineering

In recent years, engineers and executives have been attempting to implement machine learning (ML) and artificial intelligence (AI) to solve problems that, for the most part, have been solved using fairly manual methodologies. A great example would have to be advancements in natural language processing (NLP) and more specifically in natural language generation and understanding. Even more specifically, we point to AI systems that are able to read in raw text from a user (perhaps a disgruntled user of the latest smartphone) and can articulately and accurately respond with the prose of a human and the speed of a machine. In this chapter, we will be introducing topics of feature engineering, such as:

Motivating examples of why feature engineering matters
Basic understanding of machine learning, including performance, evaluation
A detailed list of the chapters included in this book

Motivating example – AI-powered communications

Meet Arty, our AI chat system that is able to handle and respond to incoming customer support requests, just as any of our human employees would. Arty is endowed with the knowledge of our company and is ready to go at a moment’s notice.

Here is how a sample dialogue between a human and an AI customer support system would transpire:

Human	AI
Hello, my phone is broken.	Sorry to hear that, how is it broken?
It’s frozen and I can’t reset it.	What kind of phone is it?
The new iDroid 28	Ahh, I see. Hold the power and volume down button for 20 seconds and it should reset.
It worked, thanks!	No problem, have a great day.

The reason that these types of systems are exciting and are disrupting major markets is the simplicity of such a complicated system. Let us break it down. On the surface, you might think, what an easy problem! The person has a simple problem with a simple solution. A request comes in and a response comes out. Hello, my phone froze, what should I do? Easy, just reset it. And sure, on the surface, that is what is happening here:

from Arty import AI
AI.respond_to("my phone froze, what should I do?")
>> "reset it."

The tough part comes in when you look at it from the AI’s perspective. It hasn’t had the entire human experience that we have had. It hasn’t had the privilege to read The Illiad or even Clifford the Big Red Dog and learn to internalize their messages. The point is, the AI hasn’t had a lot of experience in reading things. This AI has probably been given a few hundred thousand (maybe even millions) of previous chat dialogues of people in the past and was told to figure it out.

The following is a sample of data given to our AI system based on previous chat logs:

Request	Response
Helllo	Hi, what seems to be the problem?
My phone doesn’t work!!!!	Oh noo!!!! What’s going on with it?
>Hold on, I have to go walk my dog. BRB.	OK. I will wait.
Hey.	Hello. My name is Mark, how can I help?

The data is organized into two columns where the Request column represents what the end user types into a chat support dialogue. The next column, Response, represents the customer support agent’s response to the incoming message.

While reading over the thousands of typos, angry messages, and disconnected chats, the AI starts to think that it has this customer support thing down. Once this happens, the humans set the AI loose on new chats coming in. The humans, not realizing their mistake, start to notice that the AI hasn’t fully gotten the hang of this yet. The AI can’t seem to recognize even simple messages and keeps returning nonsensical responses. It’s easy to think that the AI just needs more time or more data, but these solutions are just band-aids to the bigger problem, and often do not even solve the issue in the first place.

The underlying problem is likely that the data given to the AI in the form of raw text wasn’t good enough and the AI wasn’t able to pick up on the nuances of the English language. For example, some of the problems would likely include:

Typos artificially expand the AI’s vocabulary without cause. Helllo and hello are two different words that are not related to each other.
Synonyms mean nothing to the AI. Words such as hello and hey have no similarity and therefore make the problem artificially harder.

Why feature engineering matters

Data scientists and machine learning engineers frequently gather data in order to solve a problem. Because the problem they are attempting to solve is often highly relevant and exists and occurs naturally in this messy world, the data that is meant to represent the problem can also end up being quite messy and unfiltered, and often incomplete.

This is why in the past several years, positions with titles such as Data Engineer have been popping up. These engineers have the unique job of engineering pipelines and architectures designed to handle and transform raw data into something usable by the rest of the company, particularly the data scientists and machine learning engineers. This job is not only as important as the machine learning experts’ job of creating machine learning pipelines, it is often overlooked and undervalued.

A survey conducted by data scientists in the field revealed that over 80% of their time was spent capturing, cleaning, and organizing data. The remaining less than 20% of their time was spent creating these machine learning pipelines that end up dominating the conversation. Moreover, these data scientists are spending most of their time preparing the data; more than 75% of them also reported that preparing data was the least enjoyable part of their process.

Here are the findings of the survey mentioned earlier:

Following is the graph of the what Data Scientist spend the most time doing:

As seen from the preceding graph, we breakup the Data Scientists's task in the following percentage :

Building training sets: 3%
Cleaning and organizing data: 60%
Collecting data for sets: 19%
Mining data for patterns: 9%
Refining algorithms: 5%

A similar pie diagram for what is the least enjoyable part of data science:

From the graph a similar poll for the least enjoyable part of data science revealed:

Building training sets: 10 %
Cleaning and organizing data: 57%
Collecting data sets: 21%
Mining for data patterns: 3%
Refining algorithms: 4%
Others: 5%

The uppermost chart represents the percentage of time that data scientists spend on different parts of the process. Over 80% of a data scientists' time is spent preparing data for further use. The lower chart represents the percentage of those surveyed reporting their least enjoyable part of the process of data science. Over 75% of them report that preparing data is their least enjoyable part.

Source of the data: https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/.

A stellar data scientist knows that preparing data is not only so important that it takes up most of their time, they also know that it is an arduous process and can be unenjoyable. Far too often, we take for granted clean data given to us by machine learning competitions and academic sources. More than 90% of data, the data that is interesting, and the most useful, exists in this raw format, like in the AI chat system described earlier.

Preparing data can be a vague phrase. Preparing takes into account capturing data, storing data, cleaning data, and so on. As seen in the charts shown earlier, a smaller, but still majority chunk of a data scientist's time is spent on cleaning and organizing data. It is in this process that our Data Engineers are the most useful to us. Cleaning refers to the process of transforming data into a format that can be easily interpreted by our cloud systems and databases. Organizing generally refers to a more radical transformation. Organizing tends to involve changing the entire format of the dataset into a much neater format, such as transforming raw chat logs into a tabular row/column structure.

Here is an illustration of Cleaning and Organizing:

The top transformation represents cleaning up a sample of server logs that include both the data and a text explanation of what is occurring on the servers. Notice that while cleaning, the & character, which is a Unicode character, was transformed into a more readable ampersand (&). The cleaning phase left the document pretty much in the same exact format as before. The bottom organizing transformation was a much more radical one. It turned the raw document into a row/column structure, in which each row represents a single action taken by the server and the columns represent attributes of the server action. In this case, the two attributes are Date and Text.

Both cleaning and organizing fall under a larger category of data science, which just so happens to be the topic of this book, feature engineering.

What is feature engineering?

Finally, the title of the book.

Yes, folks, feature engineering will be the topic of this book. We will be focusing on the process of cleaning and organizing data for the purposes of machine learning pipelines. We will also go beyond these concepts and look at more complex transformations of data in the forms of mathematical formulas and neural understanding, but we are getting ahead of ourselves. Let’s start a high level.

Feature engineering is the process of transforming data into features that better represent the underlying problem, resulting in improved machine learning performance.

To break this definition down a bit further, let's look at precisely what feature engineering entails:

Process of transforming data: Note that we are not specifying raw data, unfiltered data, and so on. Feature engineering can be applied to data at any stage. Oftentimes, we will be applying feature engineering techniques to data that is already processed in the eyes of the data distributor. It is also important to mention that the data that we will be working with will usually be in a tabular format. The data will be organized into rows (observations) and columns (attributes). There will be times when we will start with data at its most raw form, such as in the examples of the server logs mentioned previously, but for the most part, we will deal with data already somewhat cleaned and organized.
Features: The word features will obviously be used a lot in this book. At its most basic level, a feature is an attribute of data that is meaningful to the machine learning process. Many times we will be diagnosing tabular data and identifying which columns are features and which are merely attributes.
Better represent the underlying problem: The data that we will be working with will always serve to represent a specific problem in a specific domain. It is important to ensure that while we are performing these techniques, we do not lose sight of the bigger picture. We want to transform data so that it better represents the bigger problem at hand.
Resulting in improved machine learning performance: Feature engineering exists as a single part of the process of data science. As we saw, it is an important and oftentimes undervalued part. The eventual goal of feature engineering is to obtain data that our learning algorithms will be able to extract patterns from and use in order to obtain better results. We will talk in depth about machine learning metrics and results later on in this book, but for now, know that we perform feature engineering not only to obtain cleaner data, but to eventually use that data in our machine learning pipelines.

We know what you’re thinking, why should I spend my time reading about a process that people say they do not enjoy doing? We believe that many people do not enjoy the process of feature engineering because they often do not have the benefits of understanding the results of the work that they do.

Most companies employ both data engineers and machine learning engineers. The data engineers are primarily concerned with the preparation and transformation of the data, while the machine learning engineers usually have a working knowledge of learning algorithms and how to mine patterns from already cleaned data.

Their jobs are often separate but intertwined and iterative. The data engineers will present a dataset for the machine learning engineers, which they will claim they cannot get good results from, and ask the Data Engineers to try to transform the data further, and so on, and so forth. This process can not only be monotonous and repetitive, it can also hurt the bigger picture.

Without having knowledge of both feature and machine learning engineering, the entire process might not be as effective as it could be. That’s where this book comes in. We will be talking about feature engineering and how it relates directly to machine learning. It will be a results-driven approach where we will deem techniques as helpful if, and only if, they can lead to a boost in performance. It is worth now diving a bit into the basics of data, the structure of data, and machine learning, to ensure standardization of terminology.

Understanding the basics of data and machine learning

When we talk about data, we are generally dealing with tabular data, that is, data that is organized into rows and columns. Think of this as being able to be opened in a spreadsheet technology such as Microsoft Excel. Each row of data, otherwise known as an observation, represents a single instance/example of a problem. If our data belongs to the domain of day-trading in the stock market, an observation might represent an hour’s worth of changes in the overall market and price.

For example, when dealing with the domain of network security, an observation could represent a possible attack or a packet of data sent over a wireless system.

The following shows sample tabular data in the domain of cyber security and more specifically, network intrusion:

DateTime	Protocol	Urgent	Malicious
June 2nd, 2018	TCP	FALSE	TRUE
June 2nd, 2018	HTTP	TRUE	TRUE
June 2nd, 2018	HTTP	TRUE	FALSE
June 3rd, 2018	HTTP	FALSE	TRUE

We see that each row or observation consists of a network connection and we have four attributes of the observation: DateTime, Protocol, Urgent, and Malicious. While we will not dive into these specific attributes, we will simply notice the structure of the data given to us in a tabular format.

Because we will, for the most part, consider our data to be tabular, we can also look at specific instances where the matrix of data has only one column/attribute. For example, if we are building a piece of software that is able to take in a single image of a room and output whether or not there is a human in that room. The data for the input might be represented as a matrix of a single column where the single column is simply a URL to a photo of a room and nothing else.

For example, considering the following table of table that has only a single column titled, Photo URL. The values of the table are URLs (these are fake and do not lead anywhere and are purely for example) of photos that are relevant to the data scientist:

Photo URL

http://photo-storage.io/room/1

http://photo-storage.io/room/2

http://photo-storage.io/room/3

http://photo-storage.io/room/4

The data that is inputted into the system might only be a single column, such as in this case. In our ability to create a system that can analyze images, the input might simply be a URL to the image in question. It would be up to us as data scientists to engineer features from the URL.

As data scientists, we must be ready to ingest and handle data that might be large, small, wide, narrow (in terms of attributes), sparse in completion (there might be missing values), and be ready to utilize this data for the purposes of machine learning. Now’s a good time to talk more about that. Machine learning algorithms belong to a class of algorithms that are defined by their ability to extract and exploit patterns in data to accomplish a task based on historical training data. Vague, right? machine learning can handle many types of tasks, and therefore we will leave the definition of machine learning as is and dive a bit deeper.

We generally separate machine learning into two main types, supervised and unsupervised learning. Each type of machine learning algorithm can benefit from feature engineering, and therefore it is important that we understand each type.

Supervised learning

Oftentimes, we hear about feature engineering in the specific context of supervised learning, otherwise known as predictive analytics. Supervised learning algorithms specifically deal with the task of predicting a value, usually one of the attributes of the data, using the other attributes of the data. Take, for example, the dataset representing the network intrusion:

DateTime	Protocol	Urgent	Malicious
June 2nd, 2018	TCP	FALSE	TRUE
June 2nd, 2018	HTTP	TRUE	TRUE
June 2nd, 2018	HTTP	TRUE	FALSE
June 3rd, 2018	HTTP	FALSE	TRUE

This is the same dataset as before, but let's dissect it further in the context of predictive analytics.

Notice that we have four attributes of this dataset: DateTime, Protocol, Urgent, and Malicious. Suppose now that the malicious attribute contains values that represent whether or not the observation was a malicious intrusion attempt. So in our very small dataset of four network connections, the first, second, and fourth connection were malicious attempts to intrude a network.

Suppose further that given this dataset, our task is to be able to take in three of the attributes (datetime, protocol, and urgent) and be able to accurately predict the value of malicious. In laymen’s terms, we want a system that can map the values of datetime, protocol, and urgent to the values in malicious. This is exactly how a supervised learning problem is set up:

Network_features = pd.DataFrame({'datetime': ['6/2/2018', '6/2/2018', '6/2/2018', '6/3/2018'], 'protocol': ['tcp', 'http', 'http', 'http'], 'urgent': [False, True, True, False]})
Network_response = pd.Series([True, True, False, True])
Network_features
>>
 datetime protocol  urgent
0  6/2/2018      tcp   False
1  6/2/2018     http    True
2  6/2/2018     http    True
3  6/3/2018     http   False
Network_response
>>
 0     True
1     True
2    False
3     True
dtype: bool

When we are working with supervised learning, we generally call the attribute (usually only one of them, but that is not necessary) of the dataset that we are attempting to predict the response of. The remaining attributes of the dataset are then called the features.

Supervised learning can also be considered the class of algorithms attempting to exploit the structure in data. By this, we mean that the machine learning algorithms try to extract patterns in usually very nice and neat data. As discussed earlier, we should not always expect data to come in tidy; this is where feature engineering comes in.

But if we are not predicting something, what good is machine learning you may ask? I’m glad you did. Before machine learning can exploit the structure of data, sometimes we have to alter or even create structure. That’s where unsupervised learning becomes a valuable tool.

Unsupervised learning

Supervised learning is all about making predictions. We utilize features of the data and use them to make informative predictions about the response of the data. If we aren’t making predictions by exploring structure, we are attempting to extract structure from our data. We generally do so by applying mathematical transformations to numerical matrix representations of data or iterative procedures to obtain new sets of features.

This concept can be a bit more difficult to grasp than supervised learning, and so I will present a motivating example to help elucidate how this all works.

Unsupervised learning example – marketing segments

Suppose we are given a large (one million rows) dataset where each row/observation is a single person with basic demographic information (age, gender, and so on) as well as the number of items purchased, which represents how many items this person has bought from a particular store:

Age	Gender	Number of items purchased
25	F	1
28	F	23
61	F	3
54	M	17
51	M	8
47	F	3
27	M	22
31	F	14

This is a sample of our marketing dataset where each row represents a single customer with three basic attributes about each person. Our goal will be to segment this dataset into types or clusters of people so that the company performing the analysis can understand the customer profiles much better.

Now, of course, We’ve only shown 8 out of one million rows, which can be daunting. Of course, we can perform basic descriptive statistics on this dataset and get averages, standard deviations, and so on of our numerical columns; however, what if we wished to segment these one million people into different types so that the marketing department can have a much better sense of the types of people who shop and create more appropriate advertisements for each segment?

Each type of customer would exhibit particular qualities that make that segment unique. For example, they may find that 20% of their customers fall into a category they like to call young and wealthy that are generally younger and purchase several items.

This type of analysis and the creation of these types can fall under a specific type of unsupervised learning called clustering. We will discuss this machine learning algorithm in further detail later on in this book, but for now, clustering will create a new feature that separates out the people into distinct types or clusters:

Age	Gender	Number of items purchased	Cluster
25	F	1	6
28	F	23	1
61	F	3	3
54	M	17	2
51	M	8	3
47	F	3	8
27	M	22	5
31	F	14	1

This shows our customer dataset after a clustering algorithm has been applied. Note the new column at the end called cluster that represents the types of people that the algorithm has identified. The idea is that the people who belong to similar clusters behave similarly in regards to the data (have similar ages, genders, purchase behaviors). Perhaps cluster six might be renamed as young buyers.

This example of clustering shows us why sometimes we aren’t concerned with predicting anything, but instead wish to understand our data on a deeper level by adding new and interesting features, or even removing irrelevant features.

Note that we are referring to every column as a feature because there is no response in unsupervised learning since there is no prediction occurring.

It’s all starting to make sense now, isn’t it? These features that we talk about repeatedly are what this book is primarily concerned with. Feature engineering involves the understanding and transforming of features in relation to both unsupervised and supervised learning.

Evaluation of machine learning algorithms and feature engineering procedures

It is important to note that in literature, oftentimes there is a stark contrast between the terms features and attributes. The term attribute is generally given to columns in tabular data, while the term feature is generally given only to attributes that contribute to the success of machine learning algorithms. That is to say, some attributes can be unhelpful or even hurtful to our machine learning systems. For example, when predicting how long a used car will last before requiring servicing, the color of the car will probably not very indicative of this value.

In this book, we will generally refer to all columns as features until they are proven to be unhelpful or hurtful. When this happens, we will usually cast those attributes aside in the code. It is extremely important, then, to consider the basis for this decision. How does one evaluate a machine learning system and then use this evaluation to perform feature engineering?

Example of feature engineering procedures – can anyone really predict the weather?

Consider a machine learning pipeline that was built to predict the weather. For the sake of simplicity in our introduction chapter, assume that our algorithm takes in atmospheric data directly from sensors and is set up to predict between one of two values, sun or rain. This pipeline is then, clearly, a classification pipeline that can only spit out one of two answers. We will run this algorithm at the beginning of every day. If the algorithm outputs sun and the day is mostly sunny, the algorithm was correct, likewise, if the algorithm predicts rain and the day is mostly rainy, the algorithm was correct. In any other instance, the algorithm would be considered incorrect. If we run the algorithm every day for a month, we would obtain nearly 30 values of the predicted weather and the actual, observed weather. We can calculate an accuracy of the algorithm. Perhaps the algorithm predicted correctly for 20 out of the 30 days, leading us to label the algorithm with a two out of three or about 67% accuracy. Using this standardized value or accuracy, we could tweak our algorithm and see if the accuracy goes up or down.

Of course, this is an oversimplification, but the idea is that for any machine learning pipeline, it is essentially useless if we cannot evaluate its performance using a set of standard metrics and therefore, feature engineering as applied to the bettering of machine learning, is impossible without said evaluation procedure. Throughout this book, we will revisit this idea of evaluation; however, let’s talk briefly about how, in general, we will approach this idea.

When presented with a topic in feature engineering, it will usually involve transforming our dataset (as per the definition of feature engineering). In order to definitely say whether or not a particular feature engineering procedure has helped our machine learning algorithm, we will follow the steps detailed in the following section.

Steps to evaluate a feature engineering procedure

Here are the steps to evaluate a feature engineering procedure:

Obtain a baseline performance of the machine learning model before applying any feature engineering procedures
Apply feature engineering and combinations of feature engineering procedures
For each application of feature engineering, obtain a performance measure and compare it to our baseline performance
If the delta (change in) performance precedes a threshold (usually defined by the human), we deem that procedure helpful and apply it to our machine learning pipeline
This change in performance will usually be measured as a percentage (if the baseline went from 40% accuracy to 76% accuracy, that is a 90% improvement)

In terms of performance, this idea varies between machine learning algorithms. Most good primers on machine learning will tell you that there are dozens of accepted metrics when practicing data science.

In our case, because the focus of this book is not necessarily on machine learning and rather on the understanding and transformation of features, we will use baseline machine learning algorithms and associated baseline metrics in order to evaluate the feature engineering procedures.

Evaluating supervised learning algorithms

When performing predictive modeling, otherwise known as supervised learning, performance is directly tied to the model’s ability to exploit structure in the data and use that structure to make appropriate predictions. In general, we can further break down supervised learning into two more specific types, classification (predicting qualitative responses) and regression (predicting quantitative responses).

When we are evaluating classification problems, we will directly calculate the accuracy of a logistic regression model using a five-fold cross-validation:

# Example code for evaluating a classification problem
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
X = some_data_in_tabular_format
y = response_variable
lr = LinearRegression()
scores = cross_val_score(lr, X, y, cv=5, scoring='accuracy')
scores
>> [.765, .67, .8, .62, .99]

Similarly, when evaluating a regression problem, we will use the mean squared error (MSE) of a linear regression using a five-fold cross-validation:

# Example code for evaluating a regression problem
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
X = some_data_in_tabular_format
y = response_variable
lr = LinearRegression()
scores = cross_val_score(lr, X, y, cv=5, scoring='mean_squared_error')
scores
>> [31.543, 29.5433, 32.543, 32.43, 27.5432]

We will use these two linear models instead of newer, more advanced models for their speed and their low variance. This way, we can be surer that any increase in performance is directly related to the feature engineering procedure and not to the model’s ability to pick up on obscure and hidden patterns.

Evaluating unsupervised learning algorithms

This is a bit trickier. Because unsupervised learning is not concerned with predictions, we cannot directly evaluate performance based on how well the model can predict a value. That being said, if we are performing a cluster analysis, such as in the previous marketing segmentation example, then we will usually utilize the silhouette coefficient (a measure of separation and cohesion of clusters between -1 and 1) and some human-driven analysis to decide if a feature engineering procedure has improved model performance or if we are merely wasting our time.

Here is an example of using Python and scikit-learn to import and calculate the silhouette coefficient for some fake data:

attributes = tabular_data
cluster_labels = outputted_labels_from_clustering

from sklearn.metrics import silhouette_score
silhouette_score(attributes, cluster_labels)

We will spend much more time on unsupervised learning later on in this book as it becomes more relevant. Most of our examples will revolve around predictive analytics/supervised learning.

It is important to remember that the reason we are standardizing algorithms and metrics is so that we may showcase the power of feature engineering and so that you may repeat our procedures with success. Practically, it is conceivable that you are optimizing for something other than accuracy (such as a true positive rate, for example) and wish to use decision trees instead of logistic regression. This is not only fine but encouraged. You should always remember though to follow the steps to evaluating a feature engineering procedure and compare baseline and post-engineering performance.

It is possible that you are not reading this book for the purposes of improving machine learning performance. Feature engineering is useful in other domains such as hypothesis testing and general statistics. In a few examples in this book, we will be taking a look at feature engineering and data transformations as applied to a statistical significance of various statistical tests. We will be exploring metrics such as R²and p-values in order to make judgements about how our procedures are helping.

In general, we will quantify the benefits of feature engineering in the context of three categories:

Supervised learning: Otherwise known as predictive analytics
- Regression analysis—predicting a quantitative variable:
  - Will utilize MSE as our primary metric of measurement
- Classification analysis—predicting a qualitative variable
  - Will utilize accuracy as our primary metric of measurement
Unsupervised learning: Clustering—the assigning of meta-attributes as denoted by the behavior of data:
- Will utilize the silhouette coefficient as our primary metric of measurement
Statistical testing: Using correlation coefficients, t-tests, chi-squared tests, and others to evaluate and quantify the usefulness of our raw and transformed data

In the following few sections, we will look at what will be covered throughout this book.

Feature understanding – what’s in my dataset?

In our first subtopic, we will start to build our fundamentals in dealing with data. By understanding the data in front of us, we can start to have a better idea of where to go next. We will begin to explore the different types of data out there as well as how to recognize the type of data inside datasets. We will look at datasets from several domains and identify how they are different from each other and how they are similar to each other. Once we are able to comfortably examine data and identify the characteristics of different attributes, we can start to understand the types of transformations that are allowed and that promise to improve our machine learning algorithms.

Among the different methods of understanding, we will be looking at:

Structured versus unstructured data
The four levels of data
Identifying missing data values
Exploratory data analysis
Descriptive statistics
Data visualizations

We will begin at a basic level by identifying the structure of, and then the types of data in front of us. Once we are able to understand what the data is, we can start to fix problems with the data. As an example, we must know how much of our data is missing and what to do when we have missing data.

Make no mistake, data visualizations, descriptive statistics, and exploratory data analysis are all a part of feature engineering. We will be exploring each of these procedures from the perspective of the machine learning engineer. Each of these procedures has the ability to enhance our machine learning pipelines and we will test and alter hypotheses about our data using them.

Feature improvement – cleaning datasets

In this topic, we take the results of our understanding of the data and use them in order to clean the dataset. Much of this book will flow in such a way, using results from previous sections to be able to work on current sections. In feature improvement, our understanding will allow us to begin our first manipulations of datasets. We will be using mathematical transformations to enhance the given data, but not remove or insert any new attributes (this is for the next chapters).

We will explore several topics in this section, including:

Structuring unstructured data
Data imputing—inserting data where there was not a data before (missing data)
Normalization of data:
- Standardization (known as z-score normalization)
- Min-max scaling
- L1 and L2 normalization (projecting into different spaces, fun stuff)

By this point in the book, we will be able to identify whether our data has a structure or not. That is, whether our data is in a nice, tabular format. If it is not, this chapter will give us the tools to transform that data into a more tabular format. This is imperative when attempting to create machine learning pipelines.

Data imputing is a particularly interesting topic. The ability to fill in data where data was missing previously is trickier than it sounds. We will be proposing all kinds of solutions from the very, very easy, merely removing the column altogether, boom no more missing data, to the interestingly complex, using machine learning on the rest of the features to fill in missing spots. Once we have filled in a bulk of our missing data, we can then measure how that affected our machine learning algorithms.

Normalization uses (generally simple) mathematical tools used to change the scaling of our data. Again, this ranges from the easy, turning miles into feet or pounds into kilograms, to the more difficult, such as projecting our data onto the unit sphere (more on that to come).

This chapter and remaining chapters will be much more heavily focused on our quantitative feature engineering procedure evaluation flow. Nearly every single time we look at a new dataset or feature engineering procedure, we will put it to the test. We will be grading the performance of various feature engineering methods on the merits of machine learning performance, speed, and other metrics. This text should only be used as a reference and not as a guide to select with feature engineering the procedures you are allowed to ignore based on difficulty and change in performance. Every new data task comes with its own caveats and may require different procedures than the previous data task.

Feature selection – say no to bad attributes

By this chapter, we will have a level of comfort when dealing with new datasets. We will have under our belt the abilities to understand and clean the data in front of us. Once we are able to work with the data given to us, we can start to make big decisions such as, at what point is a feature actually an attribute. Recall that by this distinction, feature versus attribute, the question really is, which columns are not helping my ML pipeline and therefore are hurting my pipeline and should be removed? This chapter focuses on techniques used to make the decision of which attributes to get rid of in our dataset. We will explore several statistical and iterative processes that will aid us in this decision.

Among these processes are:

Correlation coefficients
Identifying and removing multicollinearity
Chi-squared tests
Anova tests
Interpretation of p-values
Iterative feature selection
Using machine learning to measure entropy and information gain

All of these procedures will attempt to suggest the removal of features and will give different reasons for doing so. Ultimately, it will be up to us, the data scientists, to make the final call over which features will be allowed to remain and contribute to our machine learning algorithms.

Feature construction – can we build it?

While in previous chapters we focused heavily on removing features that were not helping us with our machine learning pipelines, this chapter will look at techniques in creating brand new features and placing them correctly within our dataset. These new features will ideally hold new information and generate new patterns that ML pipelines will be able to exploit and use to increase performance.

These created features can come from many places. Oftentimes, we will create new features out of existing features given to us. We can create new features by applying transformations to existing features and placing the resulting vectors alongside their previous counterparts. We will also look at adding new features from separate party systems. As an example, if we are working with data attempting to cluster people based on shopping behaviors, then we might benefit from adding in census data that is separate from the corporation and their purchasing data. However, this will present a few problems:

If the census is aware of 1,700 Jon does and the corporation only knows 13, how do we know which of the 1,700 people match up to the 13? This is called entity matching
The census data would be quite large and entity matching would take a very long time

These problems and more make for a fairly difficult procedure but oftentimes create a very dense and data-rich environment.

In this chapter, we will take some time to talk about the manual creation of features through highly unstructured data. Two big examples are text and images. These pieces of data by themselves are incomprehensible to machine learning and artificial intelligence pipelines, so it is up to us to manually create features that represent the images/pieces of text. As a simple example, imagine that we are making the basics of a self-driving car and to start, we want to make a model that can take in an image of what the car is seeing in front of it and decide whether or not it should stop. The raw image is not good enough because a machine learning algorithm would have no idea what to do with it. We have to manually construct features out of it. Given this raw image, we can split it up in a few ways:

We could consider the color intensity of each pixel and consider each pixel an attribute:
- For example, if the camera of the car produces images of 2,048 x 1,536 pixels, we would have 3,145,728 columns
We could consider each row of pixels as an attribute and the average color of each row being the value:
- In this case, there would only be 1,536 rows
We could project this image into space where features represent objects within the image. This is the hardest of the three and would look something like this:

Stop sign	Cat	Sky	Road	Patches of grass	Submarine
1	0	1	1	4	0

Where each feature is an object that may or may not be within the image and the value represents the number of times that object appears in the image. If a model were given this information, it would be a fairly good idea to stop!

Feature transformation – enter math-man

This chapter is where things get mathematical and interesting. We'll have talked about understating features and cleaning them. We'll also have looked at how to remove and add new features. In our feature construction chapter, we had to manually create these new features. We, the human, had to use our brains and come up with those three ways of decomposing that image of a stop sign. Sure, we can create code that makes the features automatically, but we ultimately chose what features we wanted to use.

This chapter will start to look at the automatic creation of these features as it applies to mathematical dimensionality. If we regard our data as vectors in an n-space (n being the number of columns), we will ask ourselves, can we create a new dataset in a k-space (where k < n) that fully or nearly represents the original data, but might give us speed boosts or performance enhancements in machine learning? The goal here is to create a dataset of smaller dimensionality that performs better than our original dataset at a larger dimensionality.

The first question here is, weren't we creating data in smaller dimensionality before when we were feature selecting? If we start with 17 features and remove five, we've reduced the dimensionality to 12, right? Yes, of course! However, we aren't talking simply about removing columns here, we are talking about using complex mathematical transformations (usually taken from our studies in linear algebra) and applying them to our datasets.

One notable example we will spend some time on is called Principal Components Analysis (PCA). It is a transformation that breaks down our data into three different datasets, and we can use these results to create brand new datasets that can outperform our original!

Here is a visual example is taken from a Princeton University research experiment that used PCA to exploit patterns in gene expressions. This is a great application of dimensionality reduction as there are so many genes and combinations of genes, it would take even the most sophisticated algorithms in the world plenty of time to process them:

In the preceding screenshot, A represents the original dataset, where U, W, and V^T represent the results of a singular value decomposition. The results are then put together to make a brand new dataset that can replace A to a certain extent.

Feature learning – using AI to better our AI

The cherry on top, a cherry powered by the most sophisticated algorithms used today in the automatic construction of features for the betterment of machine learning and AI pipelines.

The previous chapter dealt with automatic feature creation using mathematical formulas, but once again, in the end, it is us, the humans, that choose the formulas and reap the benefits of them. This chapter will outline algorithms that are not in and of themselves a mathematical formula, but an architecture attempting to understand and model data in such a way that it will exploit patterns in data in order to create new data. This may sound vague at the moment, but we hope to get you excited about it!

We will focus mainly on neural algorithms that are specially designed to use a neural network design (nodes and weights). These algorithms will then impose features onto the data in such a way that can sometimes be unintelligible to humans, but extremely useful for machines. Some of the topics we'll look at are:

Restricted Boltzmann machines
Word2Vec/GLoVe for word embedding

Word2Vec and GLoVe are two ways of adding large dimensionality data to seemingly word tokens in the text. For example, if we look at a visual representation of the results of a Word2Vec algorithm, we might see the following:

By representing words as vectors in Euclidean space, we can achieve mathematical-esque results. In the previous example, by adding these automatically generated features we can add and subtract words by adding and subtracting their vector representations as given to us by Word2Vec. We can then generate interesting conclusions, such as king+man-woman=queen. Cool!

Summary

Feature engineering is a massive task to be undertaken by data scientists and machine learning engineers. It is a task that is imperative to having successful and production-ready machine learning pipelines. In the coming seven chapters, we are going to explore six major aspects of feature engineering:

Feature understanding: learning how to identify data based on its qualities and quantitative state
Feature improvement: cleaning and imputing missing data values in order to maximize the dataset's value
Feature selection -statistically selecting and subsetting feature sets in order to reduce the noise in our data
Feature construction - building new features with the intention of exploiting feature interactions
Feature transformation - extracting latent (hidden) structure within datasets in order to mathematically transform our datasets into something new (and usually better)
Feature learning - harnessing the power of deep learning to view data in a whole new light that will open up new problems to be solved.

In this book, we will be exploring feature engineering as it relates to our machine learning endeavors. By breaking down this large topic into our subtopics and diving deep into each one in separate chapters, we will be able to get a much broader and more useful understanding of how these procedures work and how to apply each one in Python.

In our next chapter, we will dive straight into our first subsection, Feature understanding. We will finally be getting our hands on some real data, so let's begin!