Using Regular Expressions and PDFs
So far, we have learned about and explored some of the core Python libraries in the context of web communication, content reading, and browser automation, for data finding and extraction.
Regular expressions (also referred to as Regex, regex, or RegEx – we will use regex throughout the rest of this chapter) are built using a predefined set of characters to form a pattern used for searching and similar activities. In Chapters 3 and 4, when carrying out web scraping, we tested and applied various available features, such as CSS selectors, XPath, and PyQuery, to find and locate specific types of activities. Regex helps us with pattern matching – we are knowingly or unknowingly using regex most of the time while working on documents or any textual content.
In a data-related context, it is very hard to avoid activities such as finding, searching, and matching. Regex provides us with a simple and elegant approach to dealing with such...