Looking for patterns
Creating a good regular expression is a bit of a design process. A regular expression that is too rigid may not be able to match all of the potentially correct matches. On the other hand, a regular expression that is not specific enough may match a large number of strings incorrectly.
The key is to look for a well-defined pattern in the data that easily distinguishes the correct matches from otherwise incorrect matches. It is usually a helpful first step to look through the data itself. This allows you to get an intuitive sense for the existence and frequency of certain patterns.
The following python script uses pandas to read the dataset into a pandas dataframe, extract the address column, and print
out a random sample of 100
addresses using the pandas series.sample()
function. A random seed of 0 is used in order to make the resulting printout consistent. The script is available in the external resources as available in the external resources as explore_addresses.py
.
import...