Measuring text similarity with Cosine Similarity measure using Java 8
Data scientists often measure the distance or similarity between two data points--sometimes for classification or clustering, sometimes for detecting outliers, and for many other cases. When they deal with texts as data points, the traditional distance or similarity measurements cannot be used. There are many standard and classic as well as emerging and novel similarity measures available for comparing two or more text data points. In this recipe, we will be using a measurement named Cosine Similarity to compute distance between two sentences. Cosine Similarity is considered to be a de facto standard in the information retrieval community and therefore widely used. In this recipe, we will use this measurement to find the similarity between two sentences in string format.
Getting ready
Although the readers can get a comprehensive outlook of the measurement from https://en.wikipedia.org/wiki/Cosine_similarity, let us see the...