Data scientist has been consistently ranked the best job in America by Forbes Magazine from 2016 to 2019, yet the best job in America has not produced the best results for the companies employing them. According to VentureBeat, 87% of data science projects fail to make it into production. This means that most of the work that data scientists perform does not impact their employer in any meaningful way.
By itself, this is not a problem. If data scientists were cheap and plentiful, companies would see a return on their investment. However, this is simply not the case. According to the 2020 LinkedIn Salary stats, data scientists earn a total compensation of around $111,000 across all career levels in the United States. It's also very easy for them to find jobs.
Burtch Works, a United States-based executive recruiting firm, reports that, as of 2018, data scientists stayed at their job for only 2.6 years on average, and 17.6% of all data scientists changed jobs that year. Data scientists are expensive and hard to keep.
Likewise, if data scientists worked fast, even though 87% of their projects fail to have an impact, a return on investment (ROI) is still possible. Failing fast means that many projects still make it into production and the department is successful. Failing slow means that the department fails to deliver.
Unfortunately, most data science departments fail slow. To understand why, you must first understand what machine learning is, how it differs from traditional software development, and the five steps common to all machine learning projects.
Defining machine learning, data science, and AI
Machine learning is the process of training statistical models to make predictions using data. It is a category within AI. AI is defined as computer programs that perform cognitive tasks such as decision making that would normally be performed by a human. Data science is a career field that combines computer science, machine learning, and other statistical techniques to solve business problems.
Data scientists use a variety of machine learning algorithms to solve business problems. Machine learning algorithms are best thought of as a defined set of mathematical computations to perform on data to make predictions. Common applications of machine learning that you may experience in everyday life include predicting when your credit card was used to make a fraudulent transaction, determining how much money you should be given when applying for a loan, and figuring out which items are suggested to you when shopping online. All of these decisions, big and small, are determined mechanistically through machine learning.
There are many types of algorithms, but it's not important for you to know them all. Random Forest, XGBoost, LightGBM, deep learning, CART decision trees, multilinear regression, naïve Bayes, logistic regression, and k-nearest neighbor are all examples of machine learning algorithms. These algorithms are powerful because they work by learning patterns in data that would be too complex or subtle for any human being to detect on their own.
What is important for you to know is the difference between supervised learning and unsupervised learning. Supervised learning uses historical, labeled data to make future predictions.
Imagine you are a restaurant manager and you want to forecast how much money you will make next month by running an advertising campaign. To accomplish this with machine learning, you would want to collect all of your sales data from previous years, including the results of previous campaigns. Since you have past results and are using them to make predictions, this is an example of supervised learning.
Unsupervised learning simply groups like data points together. It's useful when you have a lot of information about your customers and would like to group them into buckets so that you can advertise to them in a more targeted fashion. Azure AutoML, however, is strictly for supervised learning tasks. Thus, you always need to have past results available in your data when creating new AutoML models.
Machine learning versus traditional software
Traditional software development and machine learning development differ tremendously. Programmers are used to creating software that takes in input and delivers output based on explicitly defined rules. Data scientists, on the other hand, collect the desired output first before making a program. They then use this output data along with input data to create a program that learns how to predict output from input.
For example, maybe you would like to build an algorithm predicting how many car accidents would occur in a given city on a given day. First, you would begin by collecting historical data such as the number of car crashes (the desired output) and any data that you guess would be useful in predicting that number (input data). Weather data, day of the week, amount of traffic, and data related to city events can all be used as input.
Once you collect the data, your next step is to create a statistical program that finds hidden patterns between the input and output data; this is called model training. After you train your model, your next step is to set up an inference program that uses new input data to predict how many car accidents will happen that day using your trained model.
Another major difference is that, with machine learning, you never know what data you're going to need to create your solution before you try it out, and you never know what you're going to get until you build a solution. Since data scientists never know what data they need to solve any given problem, they need to ask for advice from business experts and use their intuition to identify the right data to collect.
These differences are important because successful machine learning projects look very different from successful traditional software projects; confusing the two leads to failed projects. Managers with an IT background but lacking a data science background often try to follow methods and timelines inappropriate for a machine learning project.
Frankly, it's unrealistic to assign hard timelines to a process where you don't know what data you will need or what algorithms will work, and many data science projects fail simply because they weren't given adequate time or support. There is, however, a recipe for success.
The five steps to machine learning success
Now that we know what machine learning is and how it differs from traditional software development, the next step is to learn how a typical machine learning project is structured. There are many ways you could divide the process, but there are roughly five parts, as shown in the following diagram:
Figure 1.1 – The five steps of any machine learning project
Let's look at each of these steps in turn.
Understanding the business problem
Step 1, understanding the business problem, means talking to end users about what problems they are trying to solve and translating that into a machine learning problem.
For example, a problem in the world of professional basketball may be, we are really bad at drafting European basketball players. We would like to get better at selecting the right players for our team. You will need to figure out what the business means by a good player. Along the way, you may discover that most players brought over from Europe only play a few games and are sent home, and this costs the team millions of wasted dollars.
Armed with this information, you then need to translate the problem to make it solvable by machine learning. Think about it clearly. We will use the player's historical in-game statistics and demographic information to predict the longevity of their career in the NBA would make a good machine learning project. Translating a business problem into an AI problem always means using data to try to predict a number (the number of games played in the NBA) or a category (whether the player would head home after a handful of games).
Collecting and cleansing data
Step 2, collecting and cleansing data, involves the following steps:
- Identifying and gaining access to data sources
- Retrieving all of the data you want
- Joining all of your data together
- Removing errors in the data
- Applying business logic to create a clean dataset even a layman could understand
This is harder than it sounds. Data is often dirty and hard to find.
With our basketball case, this would mean scraping publicly available data from the web to get each player's in-game statistics and demographic information. Errors are nearly guaranteed, so you will have to work in logic to remove or fix nonsensical numbers. No human being is 190 inches tall, for example, but centimeters and inches are often confused.
The best test for whether you have properly cleansed a dataset and made it clear is to give it to a layman and ask simple questions. "How tall is player Y? How many NBA games did player X participate in during his career?". If they can answer, you have succeeded.
Transforming data for machine learning
Once you have an easily understandable, cleansed dataset, the next step is transforming data for machine learning, which is called feature engineering. Feature engineering is the process of altering data for machine learning algorithms. Some features are necessary for the algorithm to work, while other features make it easier for the algorithm to find patterns. Common feature engineering techniques include one-hot encoding categorical variables, scaling numeric values, removing outliers, and filling in null values.
A complication is that different algorithms require different types of feature engineering. Unlike most algorithms, XGBoost does not require you to fill in null values. Decision trees aren't affected by outliers much, but outliers throw off regression models. Going back to our basketball problem, you would likely have to replace null values, scale numeric values so that each column contains only a range of numbers from 0 to 1, and one-hot encode categorical variables.
Important tip
One-hot encoding categorical variables simply means taking one column with many categories and turning it into many columns with either a one or a zero. For example, if you have one column with the values USA, Canada, or Mexico, one-hot encoding that column would create three columns, one for each country. A row with a product from the United States would have a 1 in the USA column and a 0 in the Canada and Mexico columns.
Training the machine learning model
Now that you have your data and you have it in just the right format, it's time to train a machine learning model. Although this step gets a lot of glamour and hype, training machine learning models is a process that is both quick and slow. With today's technology, most machine learning models can be trained with only a few lines of code.
Hyperparameter tuning, in contrast, can take a very long time. Each machine learning algorithm has settings you can control, called hyperparameters. Hyperparameter tuning is retraining machine learning algorithms multiple times until you find the right set of parameters.
Some algorithms such as Random Forest do not get much benefit out of hyperparameter tuning. Others, such as XGBoost or LightGBM, often improve drastically. Depending on the size of your data, the algorithm you're using, and the amount of compute you have available, hyperparameter tuning can take days to weeks to finish.
Notice how much you have to know about individual algorithms to become a successful data scientist? This is one of the reasons why the field has such a high barrier to entry. Do not be intimidated, but please keep this in mind as we introduce AutoML.
Delivering results to end users
You have now trained your model and tuned its parameters, and you can confidently predict which European players the NBA team should draft. Maybe you have achieved 80% accuracy, maybe 90%, but your predictions will definitely help the business. Despite your results, you still have to get end users to accept your model, trust your model, and use it. Unlike traditional software, this can require a Herculean effort.
First, end users are going to want to know why the model is giving its prediction, and, if you used the wrong algorithm, this is impossible. Black-box models use algorithms that are inherently unknowable. Then, even if you can give the business explanations, the user may feel uncomfortable with that 80% accuracy number. "What does that mean?", they will ask.
Visualizations are key to relieving some of their fears. For your basketball model, you decide to simply show the business pictures of the players they should draft, along with some simple graphs showing how many players our model accurately predicted would be NBA stars and how many European NBA stars our model failed to predict.
Putting it all together
You now know what machine learning is, how it differs from traditional software development, and the five steps inherent to any machine learning project. Unfortunately, many people in the industry do not understand any of these things. Most businesses are new to data science. Many businesses believe that data science is much more similar to software development than it is, and this interferes with the machine learning project process.
End users are confused by data scientists' questions because they don't realize that the data scientist is trying to translate their business problem into a machine learning problem. IT is confused as to why data scientists ask for access to so much data because they don't realize that data scientists don't know what data they will need before trying it out. Management is confused as to why their data scientists spend so little time building models and so much time cleansing and transforming data.
Thus, steps 1 and 2 of the machine learning process often take longer than expected. Business users fail to communicate their business problem to data scientists in a useful way, IT is slow to grant data scientists access to data, and data scientists struggle with understanding the data they receive. Step 5 is also complicated because end users expect models to be perfectly explainable like a typical software program, and earning their trust takes time.
Given that misinterpretation slows down the other steps, the rest of the data science process must be fast for companies to see ROI. Transforming data and training models is the core of data science work, after all. It is exactly what they were trained to do and it should be fast. As we shall see in the next section, this is rarely the case.