String parsing with regular expressions
How do we decompose a complex string? What if we have complex, tricky punctuation? Or—worse yet—what if we don't have punctuation, but have to rely on patterns of digits to locate meaningful information?
Getting ready
The easiest way to decompose a complex string is by generalizing the string into a pattern and then writing a regular expression that describes that pattern.
There are limits to the patterns that regular expressions can describe. When we're confronted with deeply nested documents in a language like HTML, XML, or JSON, we often run into problems, and can't use regular expressions.
The re
module contains all of the various classes and functions we need to create and use regular expressions.
Let's say that we want to decompose text from a recipe website. Each line looks like this:
>>> ingredient = "Kumquat: 2 cups"
We want to separate the ingredient...