Finding similar strings: the Levenshtein distance
When doing information extraction, in many cases we deal with misspellings, which can bring complications into the task. In order to get around this problem, several methods are available, including the Levenshtein distance. This algorithm finds the number of edits/additions/deletions needed to change one string into another. In this recipe, you will be able to use this technique to find a match for a misspelled email.
Getting ready
We will use the same packages and the dataset that we used in the previous recipe, and also the python-Levenshtein
package, which can be installed using the following command:
pip install python-Levenshtein
How to do it…
We will read the dataset into a pandas DataFrame and use the emails extracted from it to search for a misspelled email.
Your steps should be formatted like so:
- Do the necessary imports:
import pandas as pd import Levenshtein from Chapter05.regex import get_emails...