You're reading from Machine Learning for Mobile Practical guide to building intelligent mobile applications powered by machine learning

Product type Paperback

Published in Dec 2018

Publisher Packt

ISBN-13 9781788629355

Length 274 pages

Edition 1st Edition

Tools

Android

Concepts

Machine Learning

Authors (2):

Avinash Venkateswarlu

Revathi Gopalakrishnan

View More author details

The machine learning process

The machine learning process is an iterative process. It cannot be completed in one go. The most important activities to be performed for a machine learning solution are as follows:

Define the machine learning problem (it must be well-defined).
Gather, prepare, and enhance the data that is required.
Use that data to build a model. This step goes in a loop and covers the following substeps. At times, it may also lead to revisiting Step 2 on data or even require the redefinition of the problem statement:
- Select the appropriate model/machine learning algorithm
- Train the machine learning algorithm on the training data and build the model
- Test the model
- Evaluate the results
- Continue this phase until the evaluation result is satisfactory and finalize the model
Use the finalized model to make future predictions for the problem statement.

There are four major steps involved in the whole process, which is iterative and repetitive, till the objective is met. Let's get into the details of each step in the following sections. The following diagram will give a quick overview of the entire process, so it is easy to go into the details:

Defining the machine learning problem

As defined by Tom Mitchell, the problem must be a well-defined machine learning problem. The three important questions to be solved at this stage include the following:

Do we have the right problem?
Do we have the right data?
Do we have the right success criteria?

The problem should be such that the outcome that is going to be obtained as a solution to the problem is valuable for the business. There should be sufficient historical data that should be available for learning/training purposes. The objective should be measurable and we should know how much of the objective has been achieved at any point in time.

For example, if we are going to identify fraudulent transactions from a set of online transactions, then determining such fraudulent transactions is definitely valuable for the business. We need to have a sufficient set of online transactions. We should have a sufficient set of transactions that belong to various fraudulent categories. We should also have a mechanism to determine whether the outcome predicted as a fraudulent or nonfraudulent transaction can be verified and validated for the accuracy of prediction.

To give users an idea of what data would be sufficient to implement machine learning, we could say that a dataset of at least 100 items should be fine for starters and 1,000 would be nice. The more data we have that may cover all realistic scenarios for the problem domain, the better it is for the learning algorithm.

Preparing the data

The data preparation activity is key to the success of the learning solution. The data is the key entity required for machine learning and it must be prepared properly to ensure the proper end results and objectives are obtained.

Data engineers usually spend around 80-90 percent of their overall time in the data preparation phase to get the right data, as this is fundamental and the most critical task for the success of the implementation of the machine learning program.

The following actions need to be performed in order to prepare the data:

Identify all sources of data: We need to identify all data sources that can solve the problem at hand and collect the data from multiple sources—files, databases, emails, mobile devices, the internet, and so on.
Explore the data: This step involves understanding the nature of the data, as follows:
- Integrate data from different systems and explore it.
- Understand the characteristics and nature of the data.
- Go through the correlations between data entities.
- Identify the outliers. Outliers will help with identifying any problems with the data.
- Apply various statistical principles such as calculating the median, mean, mode, range, and standard deviation to arrive at data skewness. This will help with understanding the nature and spread of data.
- If data is skewed or we see the value of the range is outside the expected boundary, we know that the data has a problem and we need to revisit the source of the data.
- Visualization of data through graphs will also help with understanding the spread and quality of the data.
Preprocess the data: The goal of this step is to create data in a format that can be used for the next step:
- Data cleansing:
  - Addressing the missing values. A common strategy used to impute missing values is to replace missing values with the mean or median value. It is important to define a strategy for replacing missing values.
  - Addressing duplicate values, invalid data, inconsistent data, outliers, and so on.
- Feature selection: Choosing the data features that are the most appropriate for the problem at hand. Removing redundant or irrelevant features that will simplify the process.
- Feature transformation: This phase maps the data from one format to another that will help in proceeding to the next steps of machine learning. This involves normalizing the data and dimensionality reduction. This involves combining various features into one feature or creating new features. For example, say we have the date and time as attributes.
  
  It would be more meaningful to have them transformed as a day of the week, a day of the month, and a year, which would provide more meaningful insight:
  - To create Cartesian products of one variable with another. For example, if we have two variables, such as population density (maths, physics, and commerce) and gender (girls and boys), the features formed by a Cartesian product of these two variables might contain useful information resulting in features such as (maths_girls, physics_girls, commerce_girls, maths_boys, physics_boys, and commerce_boys).
  - Binning numeric variables to categories. For example, the size value of hips/shoulders can be binned to categories such as small, medium, large, and extra large.
  - Domain-specific features, for example, combining the subjects maths, physics, and chemistry to a maths group and combining physics, chemistry, and biology to a biology group.
Divide the data into training and test sets: Once the data is transformed, we then need to select the required test set and a training set. An algorithm is evaluated against the test dataset after training it on the training dataset. This split of the data into training and test datasets may be as direct as performing a random split of data (66 percent for training, 34 percent for testing) or it may involve more complicated sampling methods.

The 66 percent/34 percent split is just a guide. If you have 1 million pieces of data, a 90 percent/10 percent split should be enough. With 100 million pieces of data, you can even go down to 99 percent/1 percent.

A trained model is not exposed to the test dataset during training and any predictions made on that dataset are designed to be indicative of the performance of the model in general. As such, we need to make sure the selection of datasets is representative of the problem that we are solving.

Building the model

The model-building phase consists of many substeps, as indicated earlier, such as the selection of an appropriate machine learning algorithm, training the model, testing it, evaluating the model to determine whether the objectives have been achieved, and, if not, entering into the retraining phase by either selecting the same algorithm with different datasets or selecting an entirely new algorithm till the objectives are reached.

Selecting the right machine learning algorithm

The first step toward building the model is to select the right machine learning algorithm that might solve the problem.

This step involves selecting the right machine learning algorithm and building a model, then training it using the training set. The algorithm will learn from the training data patterns that map the variables to the target, and it will output a model that captures these relationships. The machine learning model can then be used to get predictions on new data for which you do not know the target answer.

Training the machine learning model

The goal is to select the most appropriate algorithm for building the machine learning model, training it, and then analyzing the results received. We begin by selecting appropriate machine learning techniques to analyze the data. The next chapter, that is, Chapter 2, Random Forest on iOS, will talk about the different machine learning algorithms and presents details of the types of problems for which they would be apt.

The training process and analyzing the results also varies based on the algorithms selected for training.

The training phase usually uses all the attributes of data present in the transformed data, which will include the predictor attributes as well as the objective attributes. All the data features are used in the training phase.

Testing the model

Once the machine learning algorithm is trained in the training data, the next step is to run the model in the test data.

The entire set of attributes or features of the data is divided into predictor attributes and objective attributes. The predictor attributes/features of the dataset are fed as input to the machine learning model and the model uses these attributes to predict the objective attributes. The test set uses only the predictor attributes. Now, the algorithm uses the predictor attributes and outputs predictions on objective attributes. Once the output is provided, it is compared against the actual data to understand the quality of output from the algorithm.

The results should be properly presented for further analysis. What to present in the results and how to present them are critical. They may also bring to the fore new business problems.

Evaluation of the model

There should be a process to test machine learning algorithms and discover whether or not we have chosen the right algorithms, and to validate the output the algorithm provides against the problem statement.

This is the last step in the machine learning process, where we check the accuracy with the defined threshold for success criteria and, if the accuracy is greater than or equal to the threshold, then we are done. If not, we need to start all over again with a different machine learning algorithm, different parameter settings, more data, and changed data transformation. All steps in the entire machine learning process can be repeated, or a subset of it can be repeated. These are repeated till we come to the definition of "done" and are satisfied with the results.

The machine learning process is a very iterative one. Findings from one step may require a previous step to be repeated with new information. For example, during the data transformation step, we may find some data quality issues that may require us to go back to acquire more data from a different source.

Each step may also require several iterations. This is of particular interest, as the data preparation step may undergo several iterations, and the model selection may undergo several iterations. In the entire sequence of activities stated for performing machine learning, any activity can be repeated any number of times. For example, it is common to try different machine learning algorithms to train the model before moving on to testing the model. So, it is important to recognize that this is a highly iterative process and not a linear one.

Test set creation: We have to define the test dataset clearly. The goal of the test dataset is as follows:

Quickly and consistently test the algorithm that has been selected to solve the problem
Test a variety of algorithms to determine whether they are able to solve the problem
Determine which algorithm would be worth using to solve the problem
Determine whether there is a problem with the data considered for evaluation purposes as, if all algorithms consistently fail to produce proper results, there is a possibility that the data itself might require a revisit

Performance measure: The performance measure is a way to evaluate the model created. Different performance metrics will need to be used to evaluate different machine learning models. These are standard performance measures from which we can choose to test our model. There may not be a need to customize the performance measures for our model.

The following are some of the important terms that need to be known to understand the performance measure of algorithms:

Overfitting: The machine learning model is overfitting the training data when we see that the model performs well on the training data but does not perform well on the evaluation data. This is because the model is memorizing the data it has seen and is unable to generalize to unseen examples.
Underfitting: The machine learning model is underfitting the training data when the model performs poorly on the training data. This is because the model is unable to capture the relationship between the input examples (often called X) and the target values (often called Y).
Cross-validation: Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it. In k-fold cross-validation, the original sample is randomly partitioned into k equally sized subsamples.
Confusion matrix: In the field of machine learning, and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm.
Bias: Bias is the tendency of a model to make predictions in a consistent way.
Variance: Variance is the tendency of a model to make predictions that vary from the true relationship between the parameters and the labels.
Accuracy: Correct results are divided by total results.
Error: Incorrect results are divided by total results.

Precision: The number of correct results returned by a machine learning algorithm are divided by the number of all returned results.
Recall: The number of correct results returned by a machine learning algorithm are divided by the number of results that should have been returned.

Making predictions/Deploying in the field

Once the model is ready, it can be deployed to the field for usage. Predictions can be done on the upcoming dataset using the model that has been built and deployed in the field.