Using regex to extract data
In the previous sections of this chapter, we explored various aspects of regex, with examples. Regex can be applied to all types of content – such as content analysis, extendibility, and time and resource (machine) analysis. This analysis is important to figure out which extraction-related options to choose, such as XPath, CSS selectors, and PyQuery.
Important note
It’s often mentioned in the literature that regex should only be applied when the content is unstructured (for data extraction), but this is not the case. Regex can be used in any type of content (structured or unstructured).
To extract data, from a scraping point of view, we’ll explore a few examples using regex and explore some of its functionality and properties.
Example 1 – Yamaha dealer information
In this example, we will be collecting information on motor dealers (dealers’ geo-location, more precisely) from https://yamaha-moto.cfaomotors...