Machine Learning with Swift

Getting Started with Machine Learning

We live in exciting times. Artificial intelligence (AI) and Machine Learning (ML) went from obscure mathematical and science fiction topics to become a part of mass culture. Google, Facebook, Microsoft, and others competed to become the first to give the world general AI. In November 2015, Google open sourced its ML framework with TensorFlow, which is suitable for running on supercomputers as well as smartphones, and since then has won a broad community. Shortly afterwards, other big companies followed the example. The best iOS app of 2016 (Apple Choice), viral photo editor Prisma owes its success entirely to a particular kind of ML algorithm: convolutional neural network (CNN). These systems were invented back in the nineties but became popular only in the noughties. Mobile devices only gained enough computational power to run them in 2014/2015. In fact, artificial neural networks became so important for practical applications that in iOS 10 Apple added native support for them in the metal and accelerate frameworks. Apple also opened Siri to third-party developers and introduced GameplayKit, a framework to add AI capabilities to your computer games. In iOS 11, Apple introduced Core ML, a framework for running pre-trained models on vendors' devices, and Vision framework for common computer vision tasks.

The best time to start learning about ML was 10 years ago. The next best time is right now.

In this chapter, we will cover the following topics:

Understanding what AI and ML is
Fundamental concepts of ML : model, dataset, and learning
Types of ML tasks
ML project life cycle
General purpose ML versus mobile ML

The motivation behind ML

Let's start with an analogy. There are two ways of learning an unfamiliar language:

Learning the language rules by heart, using textbooks, dictionaries, and so on. That's how college students usually do it.
Observing live language: by communicating with native speakers, reading books, and watching movies. That's how children do it.

In both cases, you build in your mind the language model, or, as some prefer to say, develop a sense of language.

In the first case, you are trying to build a logical system based on rules. In this case, you will encounter many problems: the exceptions to the rule, different dialects, borrowing from other languages, idioms, and lots more. Someone else, not you, derived and described for you the rules and structure of the language.

In the second case, you derive the same rules from the available data. You may not even be aware of the existence of these rules, but gradually adjust yourself to the hidden structure and understand the laws. You use your special brain cells called mirror neurons, trying to mimic native speakers. This ability is honed by millions of years of evolution. After some time, when facing the wrong word usage, you just feel that something is wrong but you can't tell immediately what exactly.

In any case, the next step is to apply the resulting language model in the real world. Results may differ. In the first case, you will experience difficulty every time you find the missing hyphen or comma, but may be able to get a job as a proofreader at a publishing house. In the second case, everything will depend on the quality, diversity, and amount of the data on which you were trained. Just imagine a person in the center of New York who studied English through Shakespeare. Would he be able to have a normal conversation with people around him?

Now we'll put the computer in place of the person in our example. Two approaches, in this case, represent the two programming techniques. The first one corresponds to writing ad hoc algorithms consisting of conditions, cycles, and so on, by which a programmer expresses rules and structures. The second one represents ML , in which case the computer itself identifies the underlying structure and rules based on the available data.

The analogy is deeper than it seems at first glance. For many tasks, building the algorithms directly is impossibly hard because of the variability in the real world. It may require the work of experts in the domain, who must describe all rules and edge cases explicitly. Resulting models can be fragile and rigid. On the other hand, this same task can be solved by allowing computers to figure out the rules on their own from a reasonable amount of data. An example of such a task is face recognition. It's virtually impossible to formalize face recognition in terms of conventional imperative algorithms and data structures. Only recently, the task was successfully solved with the help of ML .

What is ML ?

ML is a subdomain of AI that has demonstrated significant progress over the last decade, and remains a hot research topic. It is a branch of knowledge concerned with building algorithms that can learn from data and improve themselves with regards to the tasks they perform. ML allows computers to deduce the algorithm for some task or to extract hidden patterns from data. ML is known by several different names in different research communities: predictive analytics, data mining, statistical learning, pattern recognition, and so on. One can argue that these terms have some subtle differences, but essentially, they all overlap to the extent that you can use the terminology interchangeably.

Abbreviation ML may refer to many things outside of the AI domain; for example, there is a functional programming language of this name. Nevertheless, the abbreviation is widely used in the names of libraries and conferences as referring to ML . Throughout this book, we also use it in this way.

ML is already everywhere around us. Search engines, targeted ads, face and voice recognition, recommender systems, spam filtration, self-driven cars, fraud detection in bank systems, credit scoring, automated video captioning, and machine translation—all these things are impossible to imagine without ML these days.

Over recent years, ML has owed its success to several factors:

The abundance of data in different forms (big data)
Accessible computational power and specialized hardware (clouds and GPUs)
The rise of open source and open access
Algorithmic advances

Any ML system includes three essential components: data, model, and task. The data is something you provide as an input to your model. A model is a type of mathematical function or computer program that performs the task. For instance, your emails are data, the spam filter is a model, and telling spam apart from non-spam is a task. The learning in ML stands for a process of adjusting your model to the data so that the model becomes better at its task. The obvious consequences of this setup is expressed in the piece of wisdom well-known among statisticians, "Your model is only as good as your data".

Applications of ML

There are many domains where ML is an indispensable ingredient, some of them are robotics, bioinformatics, and recommender systems. While nothing prevents you from writing bioinformatic software in Swift for macOS or Linux, we will restrict our practical examples in this book to more mobile-friendly domains. The apparent reason for this is that currently, iOS remains the primary target platform for most of the programmers who use Swift on a day-to-day basis.

For the sake of convenience, we'll roughly divide all ML applications of interest for mobile developers into three plus one areas, according to the datatypes they deal with most commonly:

Digital signal processing (sensor data, audio)
Computer vision (images, video)
Natural language processing (texts, speech)
Other applications and datatypes

Digital signal processing (DSP)

This category includes tasks where input data types are signals, time series, and audio. The sources of the data are sensors, HealthKit, microphone, wearable devices (for example, Apple Watch, or brain-computer interfaces), and IoT devices. Examples of ML problems here include:

Motion sensor data classification for activity recognition
Speech recognition and synthesis
Music recognition and synthesis
Biological signals (ECG, EEG, and hand tremor) analysis

We will build a motion recognition app in Chapter 3, K-Nearest Neighbors Classifier.

Strictly speaking, image processing is also a subdomain of DSP but let's not be too meticulous here.

Computer vision

Everything related to images and videos falls into this category. We will develop some computer vision apps in Chapter 9, Convolutional Neural Networks. Examples of computer vision tasks are:

Optical character recognition (OCR) and handwritten input
Face detection and recognition
Image and video captioning
Image segmentation
3D-scene reconstruction
Generative art (artistic style transfer, Deep Dream, and so on)

Natural language processing (NLP)

NLP is a branch of knowledge at the intersection of linguistics, computer science, and statistics. We'll talk about most common NLP techniques in Chapter 10, Natural Language Processing. Applications of NLP include the following:

Automated translation, spelling, grammar, and style correction
Sentiment analysis
Spam detection/filtering
Document categorization
Chatbots and question answering systems

Other applications of ML

You can come up with many more applications that are hard to categorize. ML can be done on virtually any data if you have enough of it. Some peculiar data types are:

Spatial data: GPS location (Chapter 4, K-Means Clustering), coordinates of UI objects and touches
Tree-like structures: hierarchy of folders and files
Network-like data: occurrences of people together in your photos, or hyperlinks between web pages
Application logs and user in-app activity data (Chapter 5, Association Rule Learning)
System data: free space disk, battery level, and similar
Survey results

Using ML to build smarter iOS applications

As we know from press reports, Apple uses ML for fraud detection, and to mine useful data from beta testing reports; however, these are not examples visible on our mobile devices. Your iPhone itself has a handful of ML models built into its operating system, and some native apps helping to perform a wide range of tasks. Some use cases are well known and prominent while others are inconspicuous. The most obvious examples are Siri speech recognition, natural language understanding, and voice generation. Camera app uses face detection for focusing and Photos app uses face recognition to group photos with the same person into one album. Presenting the new iOS 10 in June 2016, Craig Federighi mentioned its predictive keyboard, which uses an LSTM algorithm (a type of recurrent neural network) to suggest the next word from the context, and also how Photos uses deep learning to recognize objects and classify scenes. iOS itself uses ML to extend battery life, provide contextual suggestions, match profiles from social networks and mail with the records in Contacts, and to choose between internet connection options. On Apple Watch, ML models are employed to recognize user motion activity types and handwritten input.

Prior to iOS 10, Apple provided some ML APIs like speech or movement recognition, but only as black boxes, without the possibility to tune the models or to reuse them for other purposes. If you wanted to do something slightly different, like detect the type of motion (which is not predefined by Apple), you had to build your own models from scratch. In iOS 10, CNN building blocks were added in the two frameworks at once: as a part of Metal API, and as a sublibrary of an Accelerate framework. Also, the first actual ML algorithm was introduced to iOS SDK: the decision tree learner in the GameplayKit.

ML capabilities continued to expand with the release of iOS 11. At the WWDC 2017, Apple presented the Core ML framework. It includes API for running pre-trained models and is accompanied by tools for converting models trained with some popular ML frameworks to Apple's own format. Still, for now it doesn't provide the possibility of training models on a device, so your models can't be changed or updated in runtime.

Looking in the App Store for the terms artificial intelligence, deep learning, ML , and similar, you'll find a lot of applications, some of them quite successful. Here are several examples:

Google Translate is doing speech recognition and synthesis, OCR, handwriting recognition, and automated translation; some of this is done offline, and some online.
Duolingo validates pronunciation, recommends optimal study materials, and employs Chatbots for language study.
Prisma, Artisto, and others turn photos into paintings using a neural artistic style transfer algorithm. Snapchat and Fabby use image segmentation, object tracking, and other computer vision techniques to enhance selfies. There are also applications for coloring black and white photos automatically.
Snapchat's video selfie filters use ML for real-time face tracking and modification.
Aipoly Vision helps blind people, saying aloud what it sees through the camera.
Several calorie counter apps recognize food through a camera. There are also similar apps to identify dog breeds, trees and trademarks.
Tens of AI personal assistants and Chatbots, with different capabilities from cow disease diagnostics, to matchmaking and stock trading.
Predictive keyboards, spellcheckers, and auto correction, for instance, SwiftKey.
Games that learn from their users and games with evolving characters/units.
There are also news, mail, and other apps that adapt to users' habits and preferences using ML .
Brain-computer interfaces and fitness wearables with the help of ML recognize different user conditions like concentration, sleep phases, and so on. At least some of their supplementary mobile apps do ML .
Medical diagnostic and monitoring through mobile health applications. For example, OneRing monitors Parkinson's disease using the data from a wearable device.

All these applications are built upon the extensive data collection and processing. Even if the application itself is not collecting the data, the model it uses was trained on some usually big dataset. In the following section, we will discuss all things related to data in ML applications.

Getting to know your data

For many years, researchers argued about what is more important: data or algorithms. But now, it looks like the importance of data over algorithms is generally accepted among ML specialists. In most cases, we can assume that the one who has better data usually beats those with more advanced algorithms. Garbage in, garbage out—this rule holds true in ML more than anywhere else. To succeed in this domain, one need not only have data, but also needs to know his data and know what to do with it.

ML datasets are usually composed from individual observations, called samples, cases, or data points. In the simplest case, each sample has several features.

Features

When we are talking about features in the context of ML , what we mean is some characteristic property of the object or phenomenon we are investigating.

Other names for the same concept you'll see in some publications are explanatory variable, independent variable, and predictor.

Features are used to distinguish objects from each other and to measure the similarity between them.

For instance:

If the objects of our interest are books, features could be a title, page count, author's name, a year of publication, genre, and so on
If the objects of interest are images, features could be intensities of each pixel
If the objects are blog posts, features could be language, length, or presence of some terms

It's useful to imagine your data as a spreadsheet table. In this case, each sample (data point) would be a row, and each feature would be a column. For example, Table 1.1 shows a tiny dataset of books consisting of four samples where each has eight features.

Table 1.1: an example of a ML dataset (dummy books):

Title	Author's name	Pages	Year	Genre	Average readers review score	Publisher	In stock
Learn ML in 21 Days	Machine Learner	354	2018	Sci-Fi	3.9	Untitled United	False
101 Tips to Survive an Asteroid Impact	Enrique Drills	124	2021	Self-help	4.7	Vacuum Books	True
Sleeping on the Keyboard	Jessica's Cat	458	2014	Non-fiction	3.5	JhGJgh Inc.	True
Quantum Screwdriver: Heritage	Yessenia Purnima	1550	2018	Sci-Fi	4.2	Vacuum Books	True

Types of features

In the books example, you can see several types of features:

Categorical or unordered: Title, author, genre, publisher. They are similar to enumeration without raw values in Swift, but with one difference: they have levels instead of cases. Important: you can't order them or say that one is bigger than another.
Binary: The presence or absence of something, just true or false. In our case, the In stock feature.
Real numbers: Page count, year, average reader's review score. These can be represented as float or double.

There are others, but these are by far the most common.

The most common ML algorithms require the dataset to consist of a number of samples, where each sample is represented by a vector of real numbers (feature vector), and all samples have the same number of features. The simplest (but not the best) way of translating categorical features into real numbers is by replacing them with numerical codes (Table 1.2).

Table 1.2: dummy books dataset after simple preprocessing:

Title	Author's name	Pages	Year	Genre	Average readers review score	Publisher	In stock
0.0	0.0	354.0	2018.0	0.0	3.9	0.0	0.0
1.0	1.0	124.0	2021.0	1.0	4.7	1.0	1.0
2.0	2.0	458.0	2014.0	2.0	3.5	2.0	1.0
3.0	3.0	1550.0	2018.0	0.0	4.2	1.0	1.0

This is an example of how your dataset may look before you feed it into your ML algorithm. Later, we will discuss the nuts and bolts of data preprocessing for specific applications.

Choosing a good set of features

For ML purposes, it's necessary to choose a reasonable set of features, not too many and not too few:

If you have too few features, this information may be not sufficient for your model to achieve the required quality. In this case, you want to construct new ones from existing features, or extract more features from the raw data.
If you have too many features you want to select only the most informative and discriminative, because the more features you have the more complex your computations become.

How do you tell which features are most important? Sometimes common sense helps. For example, if you are building a model that recommends books for you, the genre and average rating of the book are perhaps more important features than the number of pages and year of publication. But what if your features are just pixels of a picture and you're building a face recognition system? For a black and white image of size 1024 x 768, we'd get 786,432 features. Which pixels are most important? In this case, you have to apply some algorithms to extract meaningful features. For example, in computer vision, edges, corners, and blobs are more informative features then raw pixels, so there are plenty of algorithms to extract them (Figure 1.1). By passing your image through some filters, you can get rid of unimportant information and reduce the number of features significantly; from hundreds of thousands to hundreds, or even tens. The techniques that helps to select the most important subset of features is known as feature selection, while the feature extraction techniques result in the creation of new features:

Figure 1.1: Edge detection is a common feature extraction technique in computer vision. You can still recognize the object on the right image, despite it containing significantly less information than the left one.

Feature extraction, selection, and combining is a kind of the art which is known as feature engineering. This requires not only hacking and statistical skills but also domain knowledge. We will see some feature engineering techniques while working on practical applications in the following chapters. We also will step into the exciting world of deep learning: a technique that gives a computer the ability to extract high-level abstract features from the low-level features.

The number of features you have for each sample (or length of feature vector) is usually referred to as the dimensionality of the problem. Many problems are high-dimensional, with hundreds or even thousands of features. Even worse, some of those problems are sparse; that is, for each data point, most of the features are zero or missed. This is a common situation in recommender systems. For instance, imagine yourself building the dataset of movie ratings: the rows are movies and columns are users, and in each cell, you have a rating given by the user of the movie. The majority of the cells in the table will remain empty, as most of the users will never have watched most of the movies. The opposite situation is called dense, which is when most values are in place. Many problems in natural language processing and bioinformatics are high-dimensional, sparse, or both.

Feature selection and extraction help to decrease the number of features without significant loss of information, so we also call them dimensionality reduction algorithms.

Getting the dataset

Datasets can be obtained from different sources. The ones important for us are:

Classical datasets such as Iris (botanical measurements of flowers composed by R. Fisher in 1936), MNIST (60,000 handwritten digits published in 1998), Titanic (personal information of Titanic passengers from Encyclopedia Titanica and other sources), and others. Many classical datasets are available as part of Python and R ML packages. They represent some classical types of ML tasks and are useful for demonstrations of algorithms. Meanwhile, there is no similar library for Swift. Implementation of such a library would be straightforward and is a low-hanging fruit for anyone who wants to get some stars on GitHub.
Open and commercial dataset repositories. Many institutions release their data for everyone's needs under different licenses. You can use such data for training production models or while collecting your own dataset.

Some public dataset repositories include:

- The UCI ML repository: https://archive.ics.uci.edu/ml/datasets.html
- Kaggle datasets: https://www.kaggle.com/datasets
- data.world, a social network for dataset sharing: https://data.world

To find more, visit the list of repositories at KDnuggets: http://www.kdnuggets.com/datasets/index.html. Alternatively, you'll find a list of datasets at Wikipedia: https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research.

Data collection (acquisition) is required if no existing data can help you to solve your problem. This approach can be costly both in resources and time if you have to collect the data ad hoc; however, in many cases, you have data as a byproduct of some other process, and you can compose your dataset by extracting useful information from the data. For example, text corpuses can be composed by crawling Wikipedia or news sites. iOS automatically collects some useful data. HealthKit is a unified database of users' health measurements. Core Motion allows getting historical data on user's motion activities. The ResearchKit framework provides standardized routines to assess the user's health conditions. The CareKit framework standardizes the polls. Also, in some cases, useful information can be obtained from app log mining.
- In many cases, to collect data is not enough, as raw data doesn't suit many ML tasks well. So, the next step after data collection is data labeling. For example, you have collected dataset of images, so now you have to attach a label to each of them: to which category does this image belong? This can be done manually (often at expense), automatically (sometimes impossible), or semi-automatically. Manual labeling can be scaled by means of crowdsourcing platforms, like Amazon Mechanical Turk.
Random data generation can be useful for a quick check of your ideas or in combination with the TDD approach. Also, sometimes adding some controlled randomness to your real data can improve the results of learning. This approach is known as data augmentation. For instance, this approach was taken to build an optical character recognition feature in the Google Translate mobile app. To train their model, they needed a lot of real-world photos with letters in different languages, which they didn't have. The engineering team bypassed this problem by creating a large dataset of letters with artificial reflections, smudges, and all kinds of corruptions on them. This improved the recognition quality significantly.
Real-time data sources, such as inertial sensors, GPS, camera, microphone, elevation sensor, proximity sensor, touch screen, force touch, and Apple Watch sensors can be used to collect a standalone dataset or to train a model on the fly.

Real-time data sources are especially important for the special class of ML models called online ML , which allows models to embed new data. A good example of such a situation is spam filtering, where the model should dynamically adapt to the new data. It's the opposite of batch learning, when the whole training dataset should be available from the very beginning.

Data preprocessing

The useful information in the data is usually referred to as a signal. On the other hand, the pieces of data that represent errors of different kinds and irrelevant data are known as noise. Errors can occur in the data during measurements, information transmission, or due to human errors. The goal of data cleansing procedures is to increase the signal/noise ratio. During this stage, you will usually transform all data to one format, delete entries with missed values, and check suspicious outliers (they can be both noise and signal). It is widely believed among ML engineers, that the data preprocessing stage usually consumes 90% of the time allocated for the ML project. Then, algorithm tweaking consumes another 90% of time. This statement is a joke only partially (about 10% of it). In Chapter 13, Best Practices, we are going to discuss common problems with the data and how to fix them.

Choosing a model

Let's say you've defined a task and you have a dataset. What's next? Now you need to choose a model and train it on the dataset to perform that task.

The model is the central concept in ML . ML is basically a science of building models of the real world using data. The term model refers to the phenomenon being modeled, while map refers to the real territory. Depending on the situation, it can play a role of good approximation, an outdated description (in a swiftly changing environment), or even self-fulfilled prophecy (if the model affects the modeled object). "All models are wrong, but some are useful" is a well-known proverb in statistics.

Types of ML algorithms

ML models/algorithms are often divided into three groups depending on the type of input:

Supervised learning
Unsupervised learning
Reinforcement learning

This division is rather vague because some algorithms fall into two of these groups while others do not fall into any. There are also some middle states, such as semi-supervised learning.

Algorithms in these three groups can perform different tasks, and hence can be divided into subgroups according to the output of the model. Table 1.3 shows the most common ML tasks and their classification.

Supervised learning

Supervised learning is arguably the most common and easy-to-understand type of ML . All supervised learning algorithms have one prerequisite in common: you should have a labeled dataset to train them. Here, a dataset is a set of samples, plus an expected output (label) for each sample. These labels play the role of supervisor during the training.

In different publications, you'll see different synonyms for labels, including dependent variable, predicted variable, and explained variable.

The goal of supervised learning is to get a function that for every given input returns a desired output. In the most simplified version, a supervised learning process consists of two phases: training and inference. During the first phase, you train the model using your labeled dataset. On the second phase, you use your model to do something useful, like make predictions. For instance, given a set of labeled images (dataset), a neural network (model) can be trained to predict (inference) correct labels for previously unseen images.

Using supervised learning, you will usually solve one of two problems: classification or regression. The difference is in the type of labels: categorical in the first case and real numbers in the second.

To classify means simply to assign one of the labels from a predefined set. Binary classification is a special kind of classification, when you have only two labels (positive and negative). An example of a classification task is to assign spam/not-spam labels to letters. We will train our first classifier in the next chapter, and throughout this book we will apply different classifiers for many real-world tasks.

Regression is the task of assigning a real number to a given case. For example, predicting a salary given employee characteristics. We will discuss regression in Chapter 6, Linear Regression and Gradient Descent and Chapter 7, Linear Classifier and Logistic Regression, in more detail.

If the task is to sort objects in some order (output a permutation, speaking combinatorial), and labels are not really real numbers but rather an order of objects, ranking learning is at hand. You see ranking algorithms in action when you open the Siri suggestions menu on iOS. Each app placed in the list there is done so according to its relevance for you.

If labels are complicated objects, like graphs or trees, neither classification nor regression will be of use. Structured prediction algorithms are the type of algorithms to tackle those problems. Parsing English sentences into syntactic trees is an example of this kind of task.

Ranking and structured learning are beyond the scope of this book because their use cases are not as common as classification or regression, but at least now you know what to Google search for when you need to.

Unsupervised learning

In unsupervised learning, you don't have the labels for the cases in your dataset. Types of tasks to solve with unsupervised learning are: clustering, anomaly detection, dimensionality reduction, and association rule learning.

Sometimes you don't have the labels for your data points but you still want to group them in some meaningful way. You may or may not know the exact number of groups. This is the setting where clustering algorithms are used. The most obvious example is clustering users into some groups, like students, parents, gamers, and so on. The important detail here is that a group's meaning is not predefined from the very beginning; you name it only after you've finished grouping your samples. Clustering also can be useful to extract additional features from the data as a preliminary step for supervised learning. We will discuss clustering in Chapter 4, K-Means Clustering.

Outlier/anomaly detection algorithms are used when the goal is to find some anomalous patterns in the data, weird data points. This can be especially useful for automated fraud or intrusion detection. Outlier analysis is also an important detail of data cleansing.

Dimensionality reduction is a way to distill data to the most informative and, at the same time, compact representation of it. The goal is to reduce a number of features without losing important information. It can be used as a preprocessing step before supervised learning or data visualization.

Association rule learning looks for repeated patterns of user behavior and peculiar co-occurrences of items. An example from retail practice: if a customer buys milk, isn't it more probable that he will also buy cereal? If yes, then perhaps it's better to move shelves, with the cereals closer to the shelf with the milk. Having rules like this, owners of businesses can make informed decisions and adapt their services to customers' needs. In the context of software development, this can empower anticipatory design—when the app seemingly knows what you want to do next and provides suggestions accordingly. In Chapter 5, Association Rule Learning we will implement a priori one of the most well-known rule learning algorithms:

Figure 1.2: Datasets for three types of learning: supervised, unsupervised, and semi-supervised

Labeling data manually is usually a costly thing, especially if special qualification is required. Semi-supervised learning can help when only some of your samples are labeled and others are not (see the following diagram). It is a hybrid of supervised and unsupervised learning. At first, it looks for unlabeled instances, similar to the labeled ones in an unsupervised manner, and includes them in the training dataset. After this, the algorithm can be trained on this expanded dataset in a typical supervised manner.

Reinforcement learning

Reinforcement learning is special in the sense that it doesn't require a dataset (see the following diagram). Instead, it involves an agent who takes actions, changing the state of the environment. After each step, it gets a reward or punishment, depending on the state and previous actions. The goal is to obtain a maximum cumulative reward. It can be used to teach the computer to play video games or drive a car. If you think about it, reinforcement learning is the way our pets train us humans: by rewarding our actions with tail-wagging, or punishing with scratched furniture.

One of the central topics in reinforcement learning is the exploration-exploitation dilemma—how to find a good balance between exploring new options and using what is already known:

Figure 1.3: Reinforcement learning process

Table 1.3: ML tasks:

Task	Output type	Problem example	Algorithms
Supervised learning
Regression	Real numbers	Predict house prices, given its characteristics	Linear regression and polynomial regression
Classification	Categorical	Spam/not-spam classification	KNN, Naïve Bayes, logistic regression, decision trees, random forest, and SVM
Ranking	Natural number (ordinal variable)	Sort search results per relevance	Ordinal regression
Structured prediction	Structures: trees, graphs, and so on	Part-of-speech tagging	Recurrent neural networks, and conditional random field
Unsupervised learning
Clustering	Groups of objects	Build a tree of living organisms	Hierarchical clustering, k-means, and GMM
Dimensionality reduction	Compact representation of given features	Find most important components in brain activity	PCA, t-SNE, and LDA
Outlier/anomaly detection	Objects that are out of pattern	Fraud detection	Local outlier factor
Association rule learning	Set of rules	Smart house intrusion detection	A priori
Reinforcement learning
Control learning	Policy with maximum expected return	Learn to play a video game	Q-learning

Mathematical optimization – how learning works

The magic behind the learning process is delivered by the branch of mathematics called mathematical optimization. Sometimes it's also somewhat misleading being referred to as mathematical programming; the term coined long before widespread computer programming and is not directly related to it. Optimization is the science of choosing the best option among available alternatives; for example, choosing the best ML model.

Mathematically speaking, ML models are functions. You as an engineer chose the function family depending on your preferences: linear models, trees, neural networks, support vector machines, and so on. Learning is a process of picking from the family the function which serves your goals the best. This notion of the best model is often defined by another function, the loss function. It estimates a goodness of the model according to some criteria; for instance, how good the model fits the data, how complex it is, and so on. You can think of the loss function as a judge at a competition whose role is to assess the models. The objective of the learning is to find such a model that delivers a minimum to the loss function (minimize the loss), so the whole learning process is formalized in mathematical terms as a task of function minimization.

Function minimum can be found in two ways: analytically (calculus) or numerically (iterative methods). In ML , we often go for the numerical optimization because the loss functions get too complex for analytical solutions.

A nice interactive tutorial on numerical optimization can be found here: http://www.benfrederickson.com/numerical-optimization/.

From the programmer's point of view, learning is an iterative process of adjusting model parameters until the optimal solution is found. In practice, after a number of iterations, the algorithm stops improving because it is stuck in a local optimum or has reached the global optimum (see the following diagram). If the algorithm always finds the local or global optimum, we say that it converges. On the other hand, if you see your algorithm oscillating more and more and never approaching a useful result, it diverges:

Figure 1.4: Learner represented as a ball on a complex surface: it's possible for him to fall in a local minimum and never reach the global one

Mobile versus server-side ML

Most Swift developers are writing their applications for iOS. Those among us who develop their Swift applications for macOS or server-side are in a lucky position regarding ML . They can use whatever libraries and tools they want, reckoning on powerful hardware and compatibility with interpretable languages. Most of the ML libraries and frameworks are developed with server-side (or at least powerful desktops) in mind. In this book, we talk mostly about iOS applications, and therefore most practical examples consider limitations of handheld devices.

But if mobile devices have limited capabilities, we can do all ML on the server-side, can't we? Why would anyone bother to do ML locally on mobile devices at all? There are at least three issues with client-server architecture:

The client app will be fully functional only when it has an internet connection. This may not be a big problem in developed countries but this can limit your target audience significantly. Just imagine your translator app being non-functional during travel abroad.
Additional time delay introduced by sending data to the server and getting a response. Who enjoys watching progress bars or, even worse, infinite spinners while your data is being uploaded, processed, and downloaded back again? What if you need those results immediately and without consuming your internet traffic? Client-server architecture makes it almost impossible for such applications of ML as real-time video and audio processing.
Privacy concerns: any data you've uploaded to the internet is not yours anymore. In the age of total surveillance, how do you know that those funny selfies you've uploaded today to the cloud will not be used tomorrow to train face recognition, or for target-tracking algorithms for some interesting purposes, like killer drones? Many users don't like their personal information to be uploaded to some servers and possibly shared/sold/leaked to some third parties. Apple also argues for reducing data collection as much as possible.

Some of the applications can be OK (can't be great, though) with those limitations, but most developers want their apps to be responsive, secure, and useful all the time. This is something only on-device ML can deliver.

For me, the most important argument is that we can do ML without server-side. Hardware capabilities are increasing with each year and ML on mobile devices is a hot research field. Modern mobile devices are already powerful enough for many ML algorithms. Smartphones are the most personal and arguably the most important devices nowadays just because they are everywhere. Coding ML is fun and cool, so why should server-side developers have all the fun?

Additional bonuses that you get when implement ML on the mobile side are the free computation power (you are not paying for the electricity) and the unique marketing points (our app puts the power of AI inside of your pocket).

Understanding mobile platform limitations

Now, if I have persuaded you to use ML on mobile devices, you should be aware of some limitations:

Computation complexity restriction. The more you load your CPU, the faster your battery will die. It's easy to transform your iPhone into a compact heater with the help of some ML algorithms.
Some models take a long time to train. On the server, you can let your neural networks train for weeks; but on a mobile device, even minutes are too long. iOS applications can run and process some data in background mode if they have some good reasons, like playing music. Unfortunately, ML is not on the list of good reasons, so most probably, you will not be able to run it in background mode.
Some models take a long time to run. You should think in terms of frames per second and good user experience.
Memory restrictions. Some models grow during the training process, while others remain a fixed size.
Model size restrictions. Some trained models can take hundreds of megabytes or even gigabytes. But who wants to download your application from the App Store if it is so huge?
Locally stored data is mostly restricted to different types of users' personal data, meaning that you will not be able to aggregate the data of different users and perform large-scale ML on mobile devices.
Many open source ML libraries are built on top of interpretable languages, like Python, R, and MATLAB, or on top of the JVM, which makes them incompatible with iOS.

Those are only the most obvious challenges. You'll see more as we start to develop real ML apps. But don't worry, there is a way to eat this elephant piece by piece. Efforts spent on it are paid off by a great user experience and users' love. Platform restrictions are not unique to mobile devices. Developers of autonomous devices (like drones), IoT developers, wearable device developers, and many others face the same problems and deal with them successfully.

Many of these problems can be addressed by training the models on powerful hardware, and then deploying them to mobile devices. You can also choose a compromise with two models: a smaller one on a device for offline work, and a large one on the server. For offline work you can choose models with fast inference, then compress and optimize them for parallel execution; for instance, on GPU. We'll talk more about this in Chapter 12, Optimizing Neural Networks for Mobile Devices.

Me Apr 28, 2018

Just a few minutes into the examples and walk-throughs and I'm running into errors and oversights. I hope the entire book isn't like this. I buy technical books to save time, not spend more time debugging misdirections. So far the issues are minor and have only cost about an hour to resolve, and perhaps less for someone who regularly works with the prescribed tools, but again, the point is to guide the user off a cliff... I mean through the material.

Amazon Verified review