Understanding the need for pattern recognition
The simplest way to process the values of text fields to treat them as categorical variables. In a categorical variable, the data entries take on a fixed number of values. To illustrate working with categorical variables, consider a categorical field, such as the US states. If the state of Connecticut, for instance, were to appear in a large enough number of data entries, you might expect to see certain characteristic misspellings, such as the following:
Conecticut
Conneticut
Connetict
An easy way to fix all of the misspellings might be to iterate through each of the data entries and check against a list of common misspellings as is done in the following demonstration. Note that the following code sample is just for demonstration purposes and doesn't belong to a particular file:
misspellings = ["Conecticut", "Conneticut", "Connectict"] for ind in range(len(data)): if data[ind]["state"] in misspellings: data[ind]["state"] = "Connecticut...