To illustrate the life cycle and challenges of building a learning algorithm for specific data, let us consider a real example. The Nature Conservancy is working with other fishing companies and partners to monitor fishing activities and preserve fisheries for the future. So they are looking to use cameras in the future to scale up this monitoring process. The amount of data that will be produced from the deployment of these cameras will be cumbersome and very expensive to process manually. So the conservancy wants to develop a learning algorithm to automatically detect and classify different species of fish to speed up the video reviewing process.
Figure 1.1 shows a sample of images taken by conservancy-deployed cameras. These images will be used to build the system.
Figure 1.1: Sample of the conservancy-deployed cameras' output
So our aim in this example is to separate different species such as tunas, sharks, and more that fishing boats catch. As an illustrative example, we can limit the problem to only two classes, tuna and opah.
Figure 1.2: Tuna fish type (left) and opah fish type (right)
After limiting our problem to contain only two types of fish, we can take a sample of some random images from our collection and start to note some physical differences between the two types. For example, consider the following physical differences:
- Length: You can see that compared to the opah fish, the tuna fish is longer
- Width: Opah is wider than tuna
- Color: You can see that the opah fish tends to be more red while the tuna fish tends to be blue and white, and so on
We can use these physical differences as features that can help our learning algorithm(classifier) to differentiate between these two types of fish.
Explanatory features of an object are something that we use in daily life to discriminate between objects that surround us. Even babies use these explanatory features to learn about the surrounding environment. The same for data science, in order to build a learned model that can discriminate between different objects (for example, fish type), we need to give it some explanatory features to learn from (for example, fish length). In order to make the model more certain and reduce the confusion error, we can increase (to some extent) the explanatory features of the objects.
Given that there are physical differences between the two types of fish, these two different fish populations have different models or descriptions. So the ultimate goal of our classification task is to get the classifier to learn these different models and then give an image of one of these two types as an input. The classifier will classify it by choosing the model (tuna model or opah model) that corresponds best to this image.
In this case, the collection of tuna and opah fish will act as the knowledge base for our classifier. Initially, the knowledge base (training samples) will be labeled/tagged, and for each image, you will know beforehand whether it's tuna or opah fish. So the classifier will use these training samples to model the different types of fish, and then we can use the output of the training phase to automatically label unlabeled/untagged fish that the classifier didn't see during the training phase. This kind of unlabeled data is often called unseen data. The training phase of the life cycle is shown in the following diagram:
Supervised data science is all about learning from historical data with known target or output, such as the fish type, and then using this learned model to predict cases or data samples, for which we don't know the target/output.
Figure 1.3: Training phase life cycle
Let's have a look at how the training phase of the classifier will work:
- Pre-processing: In this step, we will try to segment the fish from the image by using the relevant segmentation technique.
- Feature extraction: After segmenting the fish from the image by subtracting the background, we will measure the physical differences (length, width, color, and so on) of each image. At the end, you will get something like Figure 1.4.
Finally, we will feed this data into the classifier in order to model different fish types.
As we have seen, we can visually differentiate between tuna and opah fish based on the physical differences (features) that we proposed, such as length, width, and color.
We can use the length feature to differentiate between the two types of fish. So we can try to differentiate between the fish by observing their length and seeing whether it exceeds some value (length*) or not.
So, based on our training sample, we can derive the following rule:
If length(fish)> length* then label(fish) = Tuna
Otherwise label(fish) = Opah
In order to find this length* we can somehow make length measurements based on our training samples. So, suppose we get these length measurements and obtain the histogram as follows:
Figure 1.4: Histogram of the length measurements for the two types of fish
In this case, we can derive a rule based on the length feature and differentiate the tuna and opah fish. In this particular example, we can tell that length* is 7. So we can update the preceding rule to be:
If length(fish)> 7 then label(fish) = Tuna
Otherwise label(fish) = Opah
As you may notice, this is not a promising result because of the overlap between the two histograms, as the length feature is not a perfect one to use solely for differentiating between the two types. So we can try to incorporate more features such as the width and then combine them. So, if we somehow manage to measure the width of our training samples, we might get something like the histogram as follows:
Figure 5: Histogram of width measurements for the two types of fish
As you can see, depending on one feature only will not give accurate results and the output model will do lots of misclassifications. Instead, we can somehow combine the two features and come up with something that looks reasonable.
So if we combine both features, we might get something that looks like the following graph:
Figure 1.6 : Combination between the subset of the length and width measurements for the two types of fish
Combining the readings for the length and width features, we will get a scatter plot like the one in the preceding graph. We have the red dots to represent the tuna fish and the green dots to represent the opah fish, and we can suggest this black line to be the rule or the decision boundary that will differentiate between the two types of fish.
For example, if the reading of one fish is above this decision boundary, then it's a tuna fish; otherwise, it will be predicted as an opah fish.
We can somehow try to increase the complexity of the rule to avoid any errors and get a decision boundary like the one in the following graph:
Figure 1.7: Increasing the complexity of the decision boundary to avoid misclassifications over the training data
The advantage of this model is that we get almost 0 misclassifications over the training samples. But actually this is not the objective of using data science. The objective of data science is to build a model that will be able to generalize and perform well over the unseen data. In order to find out whether we built a model that will generalize or not, we are going to introduce a new phase called the testing phase, in which we give the trained model an unlabeled image and expect the model to assign the correct label (Tuna and Opah) to it.
Data science's ultimate objective is to build a model that will work well in production, not over the training set. So don't be happy when you see your model is performing well on the training set, like the one in figure 1.7. Mostly, this kind of model will fail to work well in recognizing the fish type in the image. This incident of having your model work well only over the training set is called overfitting, and most practitioners fall into this trap.
Instead of coming up with such a complex model, you can drive a less complex one that will generalize in the testing phase. The following graph shows the use of a less complex model in order to get fewer misclassification errors and to generalize the unseen data as well:
Figure 1.8: Using a less complex model in order to be able to generalize over the testing samples (unseen data)