Effective Amazon Machine Learning

Introduction to Machine Learning and Predictive Analytics

As artificial intelligence and big data have become a ubiquitous part of our everyday lives, cloud-based machine learning services are part of a rising billion-dollar industry. Among the several services currently available on the market, Amazon Machine Learning stands out for its simplicity. Amazon Machine Learning was launched in April 2015 with a clear goal of lowering the barrier to predictive analytics by offering a service accessible to companies without the need for highly skilled technical resources.

This introductory chapter is a general presentation of the Amazon Machine Learning service and the types of predictive analytics problems it can solve. The Amazon Machine Learning platform distinguishes itself by its simplicity and straightforwardness. However, simplicity often implies that hard choices have been made. We explain what was sacrificed, why these choices make sense, and how the resulting simplicity can be extended with other services in the rich data-focused AWS ecosystem.

We explore what types of predictive analytics projects the Amazon Machine Learning platform can address and how it uses a simple linear model for regression and classification problems. Before starting a predictive analytics project, it is important to understand what context is appropriate and what constitutes good results. We present the context for successful predictions with Amazon Machine Learning (Amazon ML).

The reader will understand what sort of problems Amazon ML can address and the assumptions with regard to the underlying data. We show how Amazon ML solves linear regression and classification problems with a simple linear model and why that makes sense. Finally, we present the limitations of the platform.

This chapter addresses the following topics:

What is Machine Learning as a Service (MLaaS) and why does it matter?
How Amazon ML successfully leverages linear regression, a simple and powerful model
What is predictive analytics and what types of regression and classification problems can it address?
The necessary conditions the data must verify to obtain reliable predictions
What's missing from the Amazon ML service?

Introducing Amazon Machine Learning

In the emerging MLaaS industry, Amazon ML stands out on several fronts. Its simplicity, allied to the power of the AWS ecosystem, lowers barriers to entry in machine learning for companies while balancing out performances and costs.

Machine Learning as a Service

Amazon Machine Learning is an online service by Amazon Web Services (AWS) that does supervised learning for predictive analytics.

Launched in April 2015 at the AWS summit, Amazon ML joins a growing list of cloud-based machine learning services, such as Microsoft Azure, Google prediction, IBM Watson, Prediction IO, BigML, and many others. These online machine learning services form an offer commonly referred to as Machine Learning as a Service or MLaaS following a similar denomination pattern of other cloud-based services such as SaaS, PaaS, and IaaS respectively for Software, Platform, or Infrastructure as a Service.

Studies show that MLaaS is a potentially big business trend. ABI research, a business intelligence consultancy, estimates machine learning-based data analytics tools and services revenues to hit nearly $20 billion in 2021 as MLaaS services take off as outlined in this business report: http://iotbusinessnews.com/2016/08/01/39715-machine-learning-iot-enterprises-spikes-advent-machine-learning-service-models/

Eugenio Pasqua, Research Analyst at ABI Research, said the following:

"The emergence of the Machine-Learning-as-a-Service (MLaaS) model is good news for the market, as it cuts down the complexity and time required to implement machine learning and thus opens the doors to an increase in its adoption level, especially in the small-to-medium business sector."

The increased accessibility is a direct result of using an API-based infrastructure to build machine-learning models instead of developing applications from scratch. Offering efficient predictive analytics models without the need to code, host, and maintain complex code bases lowers the bar and makes ML available to smaller businesses and institutions.

Amazon ML takes this democratization approach further than the other actors in the field by significantly simplifying the predictive analytics process and its implementation. This simplification revolves around four design decisions that are embedded in the platform:

A limited set of tasks: binary classification, multi classification and regression
A single linear algorithm
A limited choice of metrics to assess the quality of the prediction
A simple set of tuning parameters for the underlying predictive algorithm

That somewhat constrained environment is simple enough while addressing most predictive analytics problems relevant to business. It can be leveraged across an array of different industries and use cases.

Leveraging full AWS integration

The AWS data ecosystem of pipelines, storage, environments, and Artificial Intelligence (AI) is also a strong argument in favor of choosing Amazon ML as a business platform for its predictive analytics needs. Although Amazon ML is simple, the service evolves to greater complexity and more powerful features once it is integrated in a larger structure of AWS data related services.

AWS is already a major actor in cloud computing. Here's what an excerpt from The Economist, August 2016 has to say about AWS (http://www.economist.com/news/business/21705849-how-open-source-software-and-cloud-computing-have-set-up-it-industry):

AWS shows no sign of slowing its progress towards full dominance of cloud computing's wide skies. It has ten times as much computing capacity as the next 14 cloud providers combined, according to Gartner, a consulting firm. AWS's sales in the past quarter were about three times the size of its closest competitor, Microsoft's Azure.

This gives an edge to Amazon ML, as many companies that are using cloud services are likely to be already using AWS. Adding simple and efficient machine learning tools to the product offering mix anticipates the rise of predictive analytics features as a standard component of web services. Seamless integration with other AWS services is a strong argument in favor of using Amazon ML despite its apparent simplicity.

The following architecture is a case study taken from an AWS January 2016 white paper titled Big Data Analytics Options on AWS (http://d0.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf), showing a potential AWS architecture for sentiment analysis on social media. It shows how Amazon ML can be part of a more complex architecture of AWS services:

Comparing performances

Keeping systems and applications simple is always difficult, but often worth it for the business. Examples abound with overloaded UIs bringing down the user experience, while products with simple, elegant interfaces and minimal features enjoy widespread popularity. The Keep It Simple mantra is even more difficult to adhere to in a context such as predictive analytics where performance is key. This is the challenge Amazon took on with its Amazon ML service.

A typical predictive analytics project is a sequence of complex operations: getting the data, cleaning the data, selecting, optimizing and validating a model and finally making predictions. In the scripting approach, data scientists develop codebases using machine learning libraries such as the Python scikit-learn library or R packages to handle all these steps from data gathering to predictions in production. As a developer breaks down the necessary steps into modules for maintainability and testability, Amazon ML breaks down a predictive analytics project into different entities: datasource, model, evaluation and predictions. It's the simplicity of each of these steps that makes AWS so powerful to implement successful predictive analytics projects.

Engineering data versus model variety

Having a large choice of algorithms for your predictions is always a good thing, but at the end of the day, domain knowledge and the ability to extract meaningful features from clean data is often what wins the game.

Kaggle is a well-known platform for predictive analytics competitions, where the best data scientists across the world compete to make predictions on complex datasets. In these predictive competitions, gaining a few decimals on your prediction score is what makes the difference between earning the prize or being just an extra line on the public leaderboard among thousands of other competitors. One thing Kagglers quickly learn is that choosing and tuning the model is only half the battle. Feature extraction or how to extract relevant predictors from the dataset is often the key to winning the competition.

In real life, when working on business related problems, the quality of the data processing phase and the ability to extract meaningful signal out of raw data is the most important and time consuming part of building an efficient predictive model. It is well know that "data preparation accounts for about 80% of the work of data scientists" (http://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/). Model selection and algorithm optimization remains an important part of the work but is often not the deciding factor when implementation is concerned.

A solid and robust implementation that is easy to maintain and connects to your ecosystem seamlessly is often preferred to an overly complex model developed and coded in-house, especially when the scripted model only produces small gains when compared to a service based implementation.

Amazon's expertise and the gradient descent algorithm

Amazon has been using machine learning for the retail side of its business and has build a serious expertise in predictive analytics. This expertise translates into the choice of algorithm powering the Amazon ML service.

The Stochastic Gradient Descent (SGD) algorithm is the algorithm powering Amazon ML linear models and is ultimately responsible for the accuracy of the predictions generated by the service. The SGD algorithm is one of the most robust, resilient, and optimized algorithms. It has been used in many diverse environments, from signal processing to deep learning and for a wide variety of problems, since the 1960s with great success. The SGD has also given rise to many highly efficient variants adapted to a wide variety of data contexts. We will come back to this important algorithm in a later chapter; suffice it to say at this point that the SGD algorithm is the Swiss army knife of all possible predictive analytics algorithm.

Several benchmarks and tests of the Amazon ML service can be found across the web (Amazon, Google and Azure: https://blog.onliquid.com/machine-learning-services-2/ and Amazon versus scikit-learn: http://lenguyenthedat.com/minimal-data-science-2-avazu/). Overall results show that the Amazon ML performance is on a par with other MLaaS platforms, but also with scripted solutions based on popular machine learning libraries such as scikit-learn.

For a given problem in a specific context and with an available dataset and a particular choice of a scoring metric, it is probably possible to code a predictive model using an adequate library and obtain better performances than the ones obtained with Amazon ML. But what Amazon ML offers is stability, absence of coding, and a very solid benchmark record, as well as a seamless integration with the Amazon Web Services ecosystem that already powers a large portion of the Internet.

Pricing

As with other MLaaS providers and AWS services, Amazon ML only charges for what you consume.

The cost is broken down into the following:

An hourly rate for the computing time used to build predictive models
A prediction fee per thousand prediction samples
And in the context of real-time (streaming) predictions, a fee based on the memory allocated upfront for the model

The computational time increases as a function of the following:

The complexity of the model
The size of the input data
The number of attributes
The number and types of transformations applied

At the time of writing, these charges are as follows:

$0.42 per hour for data analysis and model building fees
$0.10 per 1,000 predictions for batch predictions
$0.0001 per prediction for real-time predictions
$0.001 per hour for each 10 MB of memory provisioned for your model

These prices do not include fees related to the data storage (S3, Redshift, or RDS), which are charged separately.

During the creation of your model, Amazon ML gives you a cost estimation based on the data source that has been selected.

The Amazon ML service is not part of the AWS free tier, a 12-month offer applicable to certain AWS services for free under certain conditions.

Understanding predictive analytics

Data Science, predictive analytics, machine learning -- these terms are used in many ways and sometimes overlap each other. What they actually refer to is not always obvious.

Data science is one of the most popular technical domains whose trend erupted after the publication of the often cited Harvard Business Review article of October 2012, Data Scientist: The Sexiest Job of the 21st Century (https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century). Data science can be seen as an evolution from data mining and data analytics. Data mining is about exploring data to discover patterns that potentially lead to decisions and actions at the business level. Data science englobes data analytics and regroups a wider scope of domains, such as statistics, data visualization, predictive analytics, software engineering, and so on, under one very large umbrella.

Predictive analytics is the art of predicting future events based on past observations. It requires your data to be organized in a certain way with predictor variables and outcomes well identified. As the Danish politician Karl Kristian Steincke once said, "Making predictions is difficult especially about the future." (This quote has also been attributed to Niels Bohr, Yogi Berra and others by http://quoteinvestigator.com/2013/10/20/no-predict/). Predictive analytics applications are diverse and far ranging: predicting consumer behavior, natural events (weather, earthquakes, and so on), people's behavior or health, financial markets, industrial applications, and so on. Predictive analytics relies on supervised learning, where data and labels are given to train the model.

Machine learning comprises the tools, methods, and concepts for computers to optimize models used for predictive analytics or other goals.

Machine learning's scope is much larger than predictive analytics. Three different types of machine learning are usually considered:

Supervised learning: Assumes that a certain amount of training data with known outcomes is available and can be used to train the model. Predictive analytics is part of supervised learning.
Unsupervised learning: Is about finding patterns in existing data without knowing the outcome. Clustering customer behavior or reducing the dimensions of the dataset for visualization purposes are examples of unsupervised learning.
Reinforcement learning: Is the third type of machine learning, where agents learn to act on their own when given a set of rules and a specific reward schema. Examples of reinforcement learning applications include AlphaGo, Google's world championship Go algorithm, self-driving cars, and semi-autonomous robots. AlphaGo learned from thousands of past games and was able to beat the world Go champion in March 2016 (https://www.wired.com/2016/03/go-grandmaster-lee-sedol-grabs-consolation-win-googles-ai/). A classic reinforcement learning implementation follows this schema, where an agent adapts its actions on an environment based on the resulting rewards:

The difference between supervised and unsupervised learning in the context of binary classification and clustering is illustrated in the following two figures:

For supervised learning, the original dataset is composed of two classes (squares and circles), and we know from the start to which class each sample belongs. Giving that information to a binary classification algorithm allows for a somewhat optimized separation of the two classes. Once that separating frontier is known, the model (the line) can be used to predict the class of new samples depending on which side the sample ends up being:

In unsupervised learning, the different classes are not known. There is no ground truth. The data is given to an algorithm along with some parameters, such as the number of classes to be found, and the algorithm finds the best set of clusters in the original dataset according to a defined criteria or metric. The results may be very dependent on the initialization parameters. There is no truth, no accuracy, just an interpretation of the data. The following figure shows the results obtained by a clustering algorithm asked to find three classes in the original data:

The reader will notice at this point that the book is titled Amazon Machine Learning and not Amazon Predictive Analytics. This is a bit misleading, as machine learning covers many applications and problems besides predictive analytics. However, calling the service machine learning leaves the door open for Amazon to roll out future services that are not focused on predictive analytics. The following figure maps out the relationships between data science terms:

Building the simplest predictive analytics algorithm

Predictive analytics can be very simple. We introduce a very simple example of a predictive model in the context of binary classification based on a simple threshold.

Imagine that a truck transporting small oranges and large grapefruits runs off the road; all the boxes of fruits open up, and all the fruits end up mixed together. Equipped with a simple weighing scale and a way to roll the fruits out of the truck, you want to be able to separate them automatically based on their weights. You have some information on the average weights of small oranges (96g) and large grapefruits (166g).

According to the USDA, the average weight of a medium-sized orange is 131 grams, while a larger orange weighs approximately 184 grams, and a smaller one around 96 grams.

Large grapefruit (approx 4-1/2'' dia) 166g
Medium grapefruit (approx 4'' dia) 128g
Small grapefruit (approx 3-1/2'' dia) 100g

Your predictive model is the following:

You arbitrarily set a threshold of 130g
You weigh each fruit
If the fruit weighs more than 130g, it's a grapefruit; otherwise it's an orange

There! You have a robust reliable, predictive model that can be applied to all your mixed up fruits to separate them. Note that in this case, you've set the threshold with an educated guess. There was no machine learning involved.

In machine learning, the models learn by themselves. Instead of setting the threshold yourself, you let your program evolve and calculate the weight separation threshold of fruits by itself.

For that, you would set aside a certain number of oranges and grapefruits. This is called the training dataset. It's important that this training dataset has roughly the same number of oranges and grapefruits.

And you let the machine decide the threshold value by itself. A possible algorithm could be along these lines:

Set original weight estimation at w_0 = 100g to initialize and a counter k = 0
For each new fruit in the training dataset, adjust the weight estimation according to the following:

        For each new fruit_weight:
            w(k+1) = (k*w(k) + fruit_weight)/ (k+1)
            k = k+1

Assuming that your training dataset is representative of all the remaining fruits and that you have enough fruits, the threshold would converge under certain conditions to the best average between all the fruit weights. A value which you use to separate all the other fruits depending on whether they weight more or less than the threshold you estimated. The following plot shows the convergence of this crude algorithm to estimate the average weight of the fruits:

This problem is a typical binary classification model. If we had not two but three types of fruits (lemons, oranges, and grapefruit), we would have a multiclass classification problem.

In this example, we only have one predictor: the weight of the fruit. We could add another predictor such as the diameter. This would result in what is called a multivariate classification problem.

In practice, machine learning uses more complex algorithms such as the SGD, the linear algorithm used by Amazon ML. Other classic prediction algorithms include Support Vector Machines, Bayes classifiers, Random forests and so on. Each algorithm has its strength and set of assumptions on the dataset.

Regression versus classification

Amazon ML does two types of predictive analytics: classification and regression.

As discussed in the preceding paragraph, classification is about predicting a finite set of labels or categories for a given set of samples.

In the case of two classes, the problem is called Binary classification
When there are more than two classes and the classes are mutually exclusive, the problem is a multiclass classification problem
If the samples can belong to several classes at once, we talk about a multilabel classification problem

In short, classification is the prediction of a finite set of classes, labels, categories.

Examples of Binary classification are: buying outcome (yes/no), survival outcome (yes/no), anomaly detection (spam, bots), and so on
Examples of multiclass classification are: classifying object in images (fruits, cars, and so on), identifying a music genre, or a movement based on smartphone sensors, document classification and so on

In regression problems, the outcome has continuous values. Predicting age, weight, stock prices, salaries, rainfall, temperature, and so forth are all regression problems. We talk about multiple regression when there are several predictors and multivariate regression when the predictions predict several values for each sample. Amazon ML does univariate regression and classification, both binary and multiclass, but not multilabel.

Expanding regression to classification with logistic regression

Amazon ML uses a linear regression model for regression, binary, and multiclass predictions. Using the logistic regression model extends continuous regression to classification problems.

A simple regression model with one predictor is modeled as follows:

Here, x is the predictor, y is the outcome, and (a, b) are the model parameters. Each predicted value y is continuous and not bounded. How can we use that model to predict classes which are by definition categorical values?

Take the example of binary predictions. The method is to transform the continuous predictions that are not bounded into probabilities, which are all between 0 and 1. We then associate these probabilities to one of the two classes using a predefined threshold. This model is called the logistic regression model–misleading name as logistic regression is a classification model and not a regression one.

To transform continuous not bounded values into probabilities, we use the sigmoid function defined as follows:

This function transforms any real number into a value within the [0,1] interval. Its output can, therefore, be interpreted as a probability:

In conclusion, the way to do binary classification with a regression model is as follows:

Build the regression model, and estimate the real valued outcomes y.
Use the predicted value y as the argument of the sigmoid function. The result f(y) is a probability measure of belonging to one of the two classes.
Set a threshold T in [0,1]. All predicted samples with a probability f(y) > T belong to one class, others belong to the other class. The default value for T = 0.5.

Logistic regression is, by nature, a Binary classifier. There are several strategies to transform a binary classifier into a multi class classifier.

The one versus all (OvA) technique consists in selecting one class as positive and all the others as negative to go back to a binary classification problem. Once the classification on the first class is carried out, a second class is selected as the positive versus all the others as negative. This process is repeated N-1 times when there are N classes to predict. The following set of plots shows:

The original datasets and the classes for all the samples
The result of the first Binary classification (circles versus all the others)
The result of the second classification that separates the squares and the triangles

Extracting features to predict outcomes

That available data needs to be accessible and meaningful in order for the algorithm to extract information.

Let's consider a simple example. Imagine that we want to predict the market price of a house in a given city. We can think of many variables that would be predictors of the price of a house: the number of rooms or bathrooms, the neighborhood, the surface, the heating system, and so on. These variables are called features, attributes, or predictors. The value that we want to predict is called the outcome or the target.

If we want our predictions to be reliable, we need several features. Predicting the price of a house based on its surface alone would not be very efficient. Many other factors influence the price of a house and our dataset should include as many as possible (with conditions).

It's often possible to add large numbers of attributes to a model to try to improve the predictions. For instance, in our housing pricing prediction, we could add all the characteristics of the house (bathroom, superficies, heating system, the number of windows). Some of these variables would bring more information to our pricing model and increase the accuracy of our predictions, while others would just add noise and confuse the algorithm. Adding new variables to a predicting model does not always improve the predictions.

In order to make reliable predictions, each of the new features you bring to your model must bring some valuable piece of information. However, this is also not always the case. As we will see in Chapter 2, Machine Learning Definitions and Concepts, correlated predictors can hurt the performances of the model.

Predictive analytics is built on several assumptions and conditions:

The value you are trying to predict is predictable and not just some random noise.
You have access to data that has some degree of association to the target.
The available dataset is large enough. Reliable predictions cannot be inferred from a dataset that is too small. (For instance, you can define and therefore predict a line with two points but you cannot infer data that follows a sine curve from only two points.)
The new data you will base future predictions on is similar to the one you parameterized and trained your model on.

You may have a great dataset, but that does not mean it will be efficient for predictions.

These conditions on the data are very general. In the case of SGD, the conditions are more constrained.

Diving further into linear modeling for prediction

Amazon ML is based on linear modeling. Recall the equation for a straight line in the plan:

This linear equation with coefficients (a, b) can be interpreted as a predictive linear model with x as the predictor and y as the outcome. In this simple case, we have two parameters (a, b) and one predictor x. An example can be that of predicting the height of children with respect to their weight and find some a and b such that the following equation is true:

Let's consider the classic Lewis Taylor (1967) dataset with 237 samples of children's age, weight, height, and gender (https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm) and focus on the relation between the height and weight of the children. In this dataset, the optimal regression line follows the following equation:

The following figure illustrates the height versus weight dataset and the associated linear regression:

Consider now that we have not one predictor but several, and let's generalize the preceding linear equation to N predictors denoted by {x₁, . . . , x_n} and N +1 coefficients or {w_o, w₁, . . ., w_n} weights. The linear model can be written as follows:

Here, ŷ denotes the predicted value, (y would correspond to the true value to be predicted). To simplify notations, we will assume for the rest of the book the coefficient w_o = 0.

This equation can be rewritten in vector form as follows:

Where T is the transpose operator, X = {x₁, . . ., x_n} and W= {w₁, . . .,w_n} are the respective vectors of predictors and model weights. Under certain conditions, the coefficients w_i can be calculated precisely. However, for a large number of samples N, these calculations are expensive in terms of required computations as they involve inverting matrices of dimension N, which for large datasets is costly and slow. As the number of samples grows, it becomes more efficient to estimate these model coefficients via an iterative process.

The Stochastic Gradient Descent algorithm iteratively estimates the coefficients {w_o, w₁, . . ., w_n} of the model. At each iteration, it uses a random sample of the training dataset for which the real outcome value is known. The SGD algorithm works by minimizing a function of the prediction error:

Functions that take the prediction error as argument are also called loss functions. Different loss functions result in different algorithms. A convex loss function has a unique minimum, which corresponds to the optimal set of weights for the regression problem. We will come back to the SGD algorithm in details in later chapters. Suffice to say for now that the SGD algorithm is especially well-suited to deal with large datasets.

There are many reasons to justify selecting the SGD algorithm for general purpose predictive analysis problems:

It is robust
Its convergence properties have been extensively studied and are well known
It is well adapted to optimization techniques
It has many extensions and variants
It has low computational cost
It can be applied to regression, classification, and streaming data

Some weaknesses include the following:

The need to properly initialize its parameters
A convergence rate dependent on a parameter called the learning rate

Validating the dataset

Not all datasets lend themselves to linear modeling. There are several conditions that the samples must verify for your linear model to make sense. Some conditions are strict, others can be relaxed.

In general, linear modeling assumes the following conditions (http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/):

Normalization/standardization: Linear regression can be sensitive to predictors that exhibit very different scales. This is true for all loss functions that rely on a measure of the distance between samples or on the standard deviations of samples. Predictors with higher means and standard deviations have more impact on the model and may potentially overshadow predictors with better predictive power but more constrained range of values. Standardization of predictors puts all the predictors on the same level.
Independent and identically distributed (i.i.d.): The samples are assumed to be independent from each other and to follow a similar distribution. This property is often assumed even when the samples are not that independent from each other. In the case of time series where samples depend on previous values, using the sample to sample difference as the data is often enough to satisfy the independence assumption. As we will see in Chapter 2, Machine Learning Definitions and Concepts, confounders and noise will also negatively impact linear regression.
No multicollinearity: Linear regression assumes that there is little or no multicollinearity in the data, meaning that one predictor is not a linear composition of other predictors. Predictors that can be approximated by linear combinations of other predictors will confuse the model.
Heteroskedasticity: The standard deviation of a predictor is constant across the whole range of its values.
Gaussian distribution of the residuals: This is more than a posteriori validation that the linear regression is valid. The residuals are the differences between the true values and their linear estimation. The linear regression is considered relevant if these residuals follow a Gaussian distribution.

These assumptions are rarely perfectly met in real-life datasets. As we will see in Chapter 2, Machine Learning Definitions and Concepts, there are techniques to detect when the linear modeling assumptions are not respected, and subsequently to transform the data to get closer to the ideal linear regression context.

Missing from Amazon ML

Amazon ML offers supervised learning predictions for classification (binary and multiclass) and regression problems. It offers some very basic visualization of the original data and has a preset list of data transformations, such as binning or normalizing the data. It is efficient and simple. However, several functionalities that are important to the data scientist are unfortunately missing from the platform. Lacking these features may not be a deal breaker, but it nonetheless restricts the scope of problems Amazon ML can be applied to.

Some of the common machine learning features Amazon ML does not offer are as follows:

Unsupervised learning: It is not possible to do clustering or dimensionality reduction of your data.
A choice of models beside linear models: Non-linear Support Vector Machines, any type of Bayes classification, neural networks, and tree, based algorithms (decision trees, random forests, or boosted trees) are all absent models. All predictions, all experiments will be built on linear regression and logistic regression with the SGD.
Data visualization capabilities are reduced to histograms and density plots.
A choice of metrics: Amazon ML uses F1-score and ROC-AUC metrics for classification, and MSE for regression. It is not possible to assess the model performance with any other metric.
You cannot download your trained model and use it anywhere else than Amazon ML.

Finally, although it is not possible to directly use your own scripts (R, Python, Scala, and so on) within the Amazon ML platform, it is possible and recommended to use other AWS services, such as AWS Lambda, to preprocess the datasets. Data manipulation beyond the transformations available in Amazon ML can also be carried out with SQL if your data is stored in one of the AWS SQL enabled services (Athena, RDS, Redshift, and others).

The statistical approach versus the machine learning approach

In 2001, Leo Breiman published a paper titled Statistical Modeling: The Two Cultures (http://projecteuclid.org/euclid.ss/1009213726) that underlined the differences between the statistical approach focused on validation and explanation of the underlying process in the data and the machine learning approach, which is more concerned with the results.

Roughly put, a classic statistical analysis follows steps such as the following:

A hypothesis called the null hypothesis is stated. This null hypothesis usually states that the observation is due to randomness.
The probability (or p-value) of the event under the null hypothesis is then calculated.
If that probability is below a certain threshold (usually p < 0.05), then the null hypothesis is rejected, which means that the observation is not a random fluke.

p> 0.05 does not imply that the null hypothesis is true. It only means that you cannot reject it, as the probability of the observation happening by chance is not large enough.

This methodology is geared toward explaining and discovering the influencing factors of the phenomenon. The goal here is to establish/build a somewhat static and fully known model that will fit observations as well as possible and, therefore, will be able to predict future patterns, behaviors, and observations.

In the machine learning approach, in predictive analytics, an explicit representation of the model is not the focus. The goal is to build the best model for the prediction period, and the model builds itself from the observations. The internals of the models are not explicit. This machine learning approach is called a black box model.

By removing the need for explicit modeling of the data, the ML approach has a stronger potential for predictions. ML is focused on making the most accurate predictions possible by minimizing the prediction error of a model at the expense of explainability.