Considerations for ML
Now that you’ve created a preliminary data model that will serve as the basis for analytic reporting in Power BI, you start thinking about a process for creating tables of data to be used with Power BI machine learning. You will need to create a single table of flattened data for each machine learning model that you train, test, and deploy.
Creating tables of data to train a machine learning model entails treating each column as a feature of the algorithm that you will be training and then using to make predictions. For example, if you wanted to create a machine learning algorithm that predicts whether something is an insect, the features (ML terminology for columns on a single table) might be [Six Legs Y/N?]
, [Life Form Y/N?]
, [Count of Eyes]
, and [Weight]
, and then a column that will be predicted, such as [Insect Y/N?]
. Each row would represent something that is being evaluated for a prediction to answer the question, “Is this an insect?”
You decide to take the following approach, in the following order, so that you can do everything within Power BI:
- Data exploration and initial data model creation in Power BI Desktop Power Query.
- Analytic report created in Power BI.
- Feature discovery in Power BI.
- Create training data sets in Power Query.
- Move training data sets to Power BI dataflows.
- Train, test, deploy a Power BI machine learning model in Power BI dataflows.
This process is shown in Figure 1.22.
Figure 1.22 – All of the ETL (extract, transform, load) will happen in Power BI Power Query and Power BI dataflows
Power BI ML offers three different types of predictive model types. Those types, as defined in the Power BI service, are as follows:
- A binary prediction model predicts whether an outcome will be achieved. Effectively, a prediction of “Yes” or “No” is returned.
- General classification models predict more than two possible outcomes such as A, B, C, or D.
- A regression model will predict a numeric value along a spectrum of possible values. For example, it will predict the costs of an event based on similar past events.
As part of your preliminary planning, you consider how these options could map to the deliverables that were prioritized by your stakeholders:
- Analytic report: This deliverable will be a Power BI analytic report and could use some Power BI AI features, but it will not be a Power BI ML model. The analytic report will help you explore and identify the right data for Power BI machine learning models.
- Predict damage: Predicting whether or not damage will result from a wildlife strike is a good match for a binary prediction model since the answer will have two possible outcomes: yes or no.
- Predict size: Predicting the size of the wildlife that struck an aircraft based upon factors such as damage cost, damage location, height, time of year, and airport location will probably have multiple values that can be predicted such as Large, Medium, and Small. This requirement could be a good fit for a general classification model.
- Predict height: This deliverable predicts the height at which wildlife strikes will happen and provides that prediction as a numeric value representing height above ground level in feet. It is likely a good fit for a regression model, which predicts numeric values.
There is no way of knowing with certainty whether the FAA Wildlife Strike data will support these specific use cases, but you won’t know until you try! Discovery is a key part of the process. First, you must identify features in the data that might have predictive value, and then train and test the machine learning models in Power BI. Only then will you know what types of predictions might be possible for your project.