Quantifying the existence of patterns
If you look through the addresses output from the previous step, you may notice that not all of them have a street address as outlined earlier.
A common deviation from the street address pattern is the addition of an N or S to a street name. Another deviation is initial street names that contain more than one word:
3649 N Southport Ave 4022 N Mozart St Irving Park 260-300 Osceola Ave S St Paul, MN 55102, USA
103 & 105 Misty Morning Way Savannah, Georgia 1656 Mount Eagle Place Alexandria, Virginia
Yet another deviation is the omission of street numbers:
West Outer Drive Dearborn, Michigan Crown St New Haven, CT, USA
Depending on the project, you will usually need to decide how far to go to capture all of the variations in the data. The more complex the pattern, the more work it will take to capture.
Due to this trade-off, it is helpful to quantify how much of the data is captured by a particular pattern. In the next few subsections, I will walk through the...