Solving our initial challenge
We now put everything together and demonstrate our system for the following new post that we assign to the variable new_post
:
Disk drive problems. Hi, I have a problem with my hard disk.
After 1 year it is working only sporadically now.
I tried to format it, but now it doesn't boot any more.
Any ideas? Thanks.
As we have learned previously, we will first have to vectorize this post before we predict its label as follows:
>>> new_post_vec = vectorizer.transform([new_post]) >>> new_post_label = km.predict(new_post_vec)[0]
Now that we have the clustering, we do not need to compare new_post_vec
to all post vectors. Instead, we can focus only on the posts of the same cluster. Let us fetch their indices in the original dataset:
>>> similar_indices = (km.labels_==new_post_label).nonzero()[0]
The comparison in the bracket results in a Boolean array, and nonzero
converts that array into a smaller array containing the indices of the True
elements...