Overlapping groups
Throughout Chapter 2, Regular Expressions with Python, we've seen several operations where there was a warning about overlapping groups: for example, the findall
operation. This is something that seems to confuse a lot of people. So, let's try to bring some clarity with a simple example:
>>>re.findall(r'(a|b)+', 'abaca') ['a', 'a']
What's happening here? Why does the following expression give us 'a'
and 'a'
instead of 'aba'
and 'a'
?
Let's look at it step by step to understand the solution:
As we can see in the preceding figure, the characters aba
are matched, but the captured group is only formed by a
. This is because even though our regex is grouping every character, it stays with the last a
. Keep this in mind because it's the key to understanding how it works. Stop for a moment and think about it, we're requesting the regex engine to capture all the groups made up of a
or b
, but just for one of the characters and that's the key...