How automated ML works
ML techniques work great when it comes to finding patterns in large datasets. Today, we use these techniques for anomaly detection, customer segmentation, customer churn analysis, demand forecasting, predictive maintenance, and pricing optimization, among hundreds of other use cases.
A typical ML life cycle is comprised of data collection, data wrangling, pipeline management, model retraining, and model deployment, during which data wrangling is typically the most time-consuming task.
Extracting meaningful features out of data, and then using them to build a model while finding the right algorithm and tuning the parameters, is also a very time-consuming process. Can we automate this process using the very thing we are trying to build here (meta enough?); that is, should we automate ML? Well, that is how this all started – with someone attempting to print a 3D printer using a 3D printer.
A typical data science workflow starts with a business problem (hopefully!), and it is used to either prove a hypothesis or to discover new patterns in the existing data. It requires data; the need to clean and preprocess the data, which takes an awfully large amount of time – almost as much as 80% of your total time; and "data munging" or wrangling, which includes cleaning, de-duplication, outlier analysis and removal, transforming, mapping, structuring, and enriching. Essentially, we're taming this unwieldy vivacious raw real-world data and putting it in a tame desired format for analysis and modeling so that we can gain meaningful insights from it.
Next, we must select and engineer features, which means figuring out what features are useful, and then brainstorming and working with SMEs on the importance and validity of these features. Validating how these features would work with your model, the fitness from both a technical and business perspective, and improving these features as needed is also a critical part of the feature engineering process. The feedback loop to the SME is often very important, albeit being the least emphasized part of the feature engineering pipeline. The transparency of models stems from clear features – if features such as race or gender give higher accuracy regarding your loan repayment propensity model, this does not mean it's a good idea use them. In fact, an SME would tell you – if your conscious mind hasn't – that it's a terrible idea and that you should look for more meaningful and less sexist, racist, and xenophobic features. We will discuss this further in Chapter 10, AutoML in the Enterprise, when we discuss operationalization.
Even though the task of "selecting a model family" sounds like a reality show, that is what data scientists and ML engineers do as part of their day-to-day job. Model selection is the task of picking the right model that best describes the data at hand. This involves selecting a ML model from a set of candidate models. Automated ML can give you a helping hand with this.
Hyperparameters
You will hear about hyperparameters a lot, so let's make sure you understand what they are.
Each model has its own internal and external parameters. Internal parameters (also known as model parameters, or just parameters) are the ones intrinsic to the model, such as its weight and predictor matrix, while external parameters or hyperparameters are "outside" the model itself, such as the learning rate and its number of iterations. An intuitive example can be derived from k-means, a well-understood unsupervised clustering algorithm known for its simplicity.
The k in k-means stands for the number of clusters required, and epochs (pronounced epics, as in Doctor Who is an epic show!) are used to specify the number of passes that are done over the training data. Both of these are examples of hyperparameters – that is, the parameters that are not intrinsic to the model itself. Similarly, the learning rate for training a neural network, C and sigma for support vector machines, the k number of leaves or depth of a tree, the latent factors in a matrix factorization, and the number of hidden layers in a deep neural network are all examples of hyperparameters.
Selecting the right hyperparameters has been called tuning your instrument, which is where the magic happens. In ML tribal folklore, these elusive numbers have been brandished as "nuisance parameters", to the point where proverbial statements such as "tuning is more of an art than a science" and "tuning models is like black magic" tend to discourage newcomers in the industry. Automated ML is here to change this perception by helping you choose the right hyperparameters; more on this later. Automated ML enables citizen data scientists to build, train, and deploy ML models, thus possibly disrupting the status quo.
Important note
Some consider the term "citizen data scientists" as a euphuism for non-experts, but SME and people who are curious about analytics are some of the most important people – and don't let anyone tell you otherwise.
In conclusion, from building the correct ensembles of models to preprocessing the data, selecting the right features and model family, choosing and optimizing model hyperparameters, and evaluating the results, automated ML offers algorithmic solutions that can programmatically address these challenges.
The need for automated ML
At the time of writing, Open AI's GPT-3 model has recently been announced, and it has an incredible 175 billion parameters. Due to this ever-increasing model complexity, which includes big data and an exponentially increasing number of features, we now have a necessity to not only be able to tune these parameters, but also have sophisticated, repeatable procedures in place to tweak these proverbial knobs so that they can be adjusted. This complexity makes it less accessible for citizen data scientists, business subject matter experts, and domain experts – which might sound like job security, but it is not good for business, nor for the long-term success of the field.
Also, this isn't just about the hyperparameters, but the entire pipeline and the reproducibility of the results becoming harder as the model's complexity grows, which curtails AI democratization.