Finding similar strings – Levenshtein distance
When doing information extraction, in many cases, we deal with misspellings, which can bring complications to the task. To get around this problem, several methods are available, including Levenshtein distance. This algorithm finds the number of edits/additions/deletions needed to change one string into another. For example, to change the word put into pat, you need to substitute u for a, and that is one change. To change the word kitten into smitten, you need to do two edits: change k into m and add an s at the start.
In this recipe, you will be able to use this technique to find a match to a misspelled email.
Getting ready
We will use the same packages and the data scientist job description dataset that we used in the previous recipe, and the python-Levenshtein
package, which is part of the Poetry environment and is included in the requirements.txt
file.
The notebook is located at https://github.com/PacktPublishing...