Training your first ML model
As we start applying AI/ML in marketing, it’s crucial to understand the fundamental steps involved in any data science project. The Iris dataset is a classic classification example that is widely used in ML due to its simplicity and informative features. This will give you a hands-on introduction to the end-to-end process of performing AI/ML data analysis within the Jupyter Notebook environment via the following steps:
- Step 1: Importing the necessary libraries
- Step 2: Loading the data
- Step 3: Exploratory data analysis (EDA)
- Step 4: Preprocessing the data
- Step 5: Model training
- Step 6: Evaluating the model
Step 1: Importing the necessary libraries
Before getting into the steps, let’s start by importing all the required libraries using the following code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import precision_score, recall_score, f1_score
Step 2: Loading the data
The Iris dataset contains 150 records of iris flowers, including measurements of their sepals and petals, along with the species of the flower. Scikit-learn provides easy access to this dataset and loads it into a pandas DataFrame:
iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target_names[iris.target]
iris_df.head()
Once the code is entered into your Jupyter notebook and run, it should appear as follows:
Figure 1.9: View of the first 5 rows of the Iris dataset
Step 3: Exploratory data analysis
After loading the dataset, it’s time to dive deeper into the underlying data characteristics:
- We can first understand the structure of our data via the following command:
print(iris_df.info())
This gives us the following output:
Figure 1.10: View of the data structure of the Iris dataset
The above result gives us insights into the data structure, including column names and data types, non-null counts (missing values can significantly impact the performance of ML models), and memory usage (useful for managing computing resources, especially when working with large datasets).
- Next, we can visualize the distribution of features to get insights into the nature of the data we’re dealing with. Histograms are graphical representations that summarize the distribution of numerical data by dividing it into intervals or “bins” and displaying how many data points fall into each bin:
iris_df.hist(figsize=(12, 8), bins=20) plt.subtitle('Feature Distribution') plt.show()
Figure 1.11: Histograms of the distribution of features in the Iris dataset
The histograms for the Iris dataset’s features give us valuable insights into their feature characteristics, including their distribution shape (whether they are bell-shaped or skewed), outliers and anomalies (which can significantly affect model performance), and feature separability (if one feature consistently falls into a species bin that doesn’t overlap much with the other species, it might be a good predictor for that species).
- Lastly, we can utilize scatter plots to visualize pairwise relationships between features and employ pair plots to gain insight into how each feature interacts with others across the different species. We can generate scatter plots for pairs of features via the following:
sns.scatterplot(x='sepal length (cm)', y='sepal width (cm)', hue='species', data=iris_df) plt.title('Sepal Length vs. Sepal Width') plt.show() sns.scatterplot(x='petal length (cm)', y='petal width (cm)', hue='species', data=iris_df) plt.title('Petal Length vs. Petal Width') plt.show()
This gives us the following output:
Figure 1.12: Scatter plots of pairwise relationships between features in different species in the Iris dataset
As shown by the above scatter plots, petal length and petal width may demonstrate clear clustering by species, suggesting they are strong predictors for species classification.
Pair plots offer a more comprehensive view by showing scatter plots for every pair of features in the dataset. Additionally, histograms along the diagonal provide distributions for each feature, segmented by species:
sns.pairplot(iris_df, hue='species')
plt.show()
This yields the following output:
Figure 1.13: Pair plots for the Iris dataset, showing scatter plots for each pair of features and histograms along the diagonal
The above pair plots allow us to quickly identify which features have linear relationships or clear separation between species across multiple dimensions. For instance, the combination of petal length and petal width shows a distinct separation between species, indicating that these features are particularly useful for classification tasks. The histograms along the diagonal help in understanding the distribution of each feature within each species, providing insights into how these distributions can be leveraged for predictive modeling. For instance, if a feature shows a tight, well-defined distribution within a species, it suggests that the feature is a reliable predictor for that species. Conversely, a feature with a wider spread within a species may be less reliable as a predictor.
Importance of visual EDA
Visual EDA is a powerful first step in the modeling process. By identifying patterns, clusters, and outliers, we can make informed decisions about feature selection, preprocessing, and the choice of ML models.
Step 4: Preparing the data for ML
The next crucial step is to prepare our data for ML. This process typically involves selecting features for the model, splitting the data into training and testing sets, and sometimes transforming the features to better suit the algorithms you plan to use.
In supervised learning tasks like the one given here, we distinguish between features (independent variables) and the target (dependent variable). In the Iris
dataset:
- Features include the measurements: sepal length, sepal width, petal length, and petal width.
- Target is the species of the iris plant.
For our example, we will follow these steps:
- We use all four measurements as features to predict the species of the iris plant, making this a multi-class classification problem:
X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']] y = iris_df['species']
- To assess the performance of an ML model, we split our dataset into a training set and a testing set. The training set is used to train the model, and the testing set is used to evaluate its performance on unseen data. A common split ratio is 80% for training and 20% for testing. We can easily split our data using the
train_test_split
function from scikit-learn:X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Importance of data splitting
Splitting the data into training and testing sets is a fundamental practice in ML. It helps in evaluating the model’s performance accurately and ensures that it can generalize well to new, unseen data. By training and testing on different sets, we mitigate the risk of overfitting, where the model performs well on the training data but poorly on new data.
- Some ML algorithms are sensitive to the scale of the data. For example, algorithms that compute distances between data points (like K-nearest neighbors) can be affected by features that are on different scales. Feature scaling can be applied via the following code:
scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns) X_train_scaled_df.head()
This gives us the following output:
Figure 1.14: Scaled features for the first 5 rows of the Iris dataset
Now, each feature’s values are centered around 0 with a unit variance. This step is essential for certain ML algorithms that are sensitive to the scale of the data and ensures that each feature contributes proportionately to the final model.
Step 5: Training a model
With our data loaded, explored, and prepared, we’re now ready to move on to one of the most exciting parts of an ML project: training a model. For this example, we will use a decision tree classifier, a versatile ML algorithm that works well for classification tasks and is easy to understand and interpret as it mimics human decision-making. The decision tree will help us predict the species of iris plants based on the features we prepared earlier.
Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree represents a feature in the instance being classified, and each branch represents a value that the node can assume. We can train our decision tree classifier using scikit-learn:
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
Once our model is trained, we can use it to make predictions. We will predict the species of iris plants using the features from our testing set:
y_pred = dt_classifier.predict(X_test)
print("First few predictions:", y_pred[:5])
The following is the output:
First few predictions: ['versicolor' 'setosa' 'virginica' 'versicolor' 'versicolor']
Why use a decision tree?
Decision trees are a popular choice for classification tasks because they don’t require much data preparation, are easy to interpret and visualize, and can handle both numerical and categorical data. For beginners in ML, decision trees offer a clear and intuitive way to understand the basics of model training and prediction.
Visualizing the decision tree can provide insight into how the model makes its decisions:
plt.figure(figsize=(20,10))
plot_tree(dt_classifier, filled=True, feature_names=iris.feature_names, class_names=iris.target_names.tolist())
This gives us the following output:
Figure 1.15: Visualization of a decision tree classifier built on the Iris dataset
The above visualization shows the splits that the tree makes on the features, the criteria for these splits, and the eventual leaf nodes where the final predictions are made based on the majority class from the training samples that fall into that leaf. For those new to ML, seeing this process can clarify how a seemingly simple algorithm can effectively classify instances. Visualizing your model can also highlight areas where the model might be overfitting by creating overly complex decision paths.
Step 6: Evaluating the model
The last step is to use the model predictions on our test set and evaluate its performance. This step is crucial as it helps us understand how well our model can generalize to unseen data.
Using the trained decision tree classifier, we can now calculate precision, recall, and the F1-score. Let’s look at what exactly these are:
- Precision measures the accuracy of the positive predictions. It is the ratio of true positive predictions to the total positive predictions (including both true positives and false positives). High precision indicates that the model is reliable in its positive predictions.
- Recall (or sensitivity) measures the ability of the model to capture all actual positives. It is the ratio of true positive predictions to the total actual positives (including both true positives and false negatives). High recall means that the model is good at capturing positive instances without missing many.
- The F1-score is the harmonic mean of precision and recall, providing a single metric to assess the balance between them. An F1-score reaches its best value at 1 (perfect precision and recall) and worst at 0.
We’ll use the built-in function in scikit-learn to calculate these metrics. These predictions can then be compared to the actual species to evaluate some performance metrics of our model:
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")
This gives us the following output:
Precision: 1.00
Recall: 1.00
F1-Score: 1.00
Achieving a score of 1.0 in precision, recall, and F1-score is exceptional and indicates perfect model performance on the test set; however, in real-world scenarios, especially with complex and noisy data, such perfect scores are rare and should be approached with caution, as they may not always reflect the model’s ability to generalize to unseen data. The Iris dataset is relatively small and well-structured, with clear distinctions between the classes. Thus, this simplicity makes it easier to achieve high-performance metrics compared to real-world datasets, for which model training and evaluation are typically more complex. As we will discuss in future chapters, further output diagnostics such as the confusion matrix can also be used to gain insight into model strengths and weaknesses.
Understanding model performance
Model accuracy is a vital metric in assessing the effectiveness of an ML model. An accuracy score close to 1.0 indicates a high level of correct predictions. However, it’s also important to consider other metrics like precision, recall, and the confusion matrix for a more comprehensive evaluation, especially in datasets with imbalanced classes.
Congratulations! By completing these steps—training a model, making predictions, and evaluating its performance—you’ve covered the essential workflow of an ML project. The skills and concepts you’ve practiced here are directly applicable to the exercises you will perform in the following chapters towards creating effective, data-driven marketing campaigns.