Quantifiers—star, plus, and question mark
When you want to match a character or pattern more than once, or want to make characters or patterns optional, the star, plus, and question mark sign are exactly what's needed. These characters are called quantifiers. They are also known as metacharacters since for example the plus sign (as we will see) does not match a literal plus sign in a string, but instead means something else.
Question Mark
The question mark is used to match something zero or one time, or phrased differently, it makes a match optional.
A simple example is the regex colou?r
, which matches both the US spelling of color as well as the British colour. The question mark after the u
makes the u
optional, meaning the regex will match both with the u
present and with it absent.
Star
The star means that something should match zero or more times. For example, this regex can be used to match the string Once upon a time:
Once .* a time
The dot matches anything, and the star means that the match should be made zero or more times, so this will end up matching the string Once upon a time
. (And any other string beginning with "Once " and ending with " a time"
.)
Plus sign
The plus sign is similar to the star—it means "match one or more times", so the difference to the star is that the plus must match at least one character. In the previous example, if the regex had instead been Once upon a time.+
then the string Once upon a time
would not match any longer, as one or more additional characters would be required.
Grouping
If you wanted to create a regex that allows a word to appear one or more times in a row, for example to match both A really good day
and A really really good day
then you would need a way to specify that the word in question (really
in this case) could be repeated. Using the regex A really+ good day
would not work, as the plus quantifier would only apply to the character y
. What we'd want is a way to make the quantifier apply to the whole word (including the space that follows it). The solution is to use grouping to indicate to the regex engine that the plus should apply to the whole word. Grouping is achieved by using standard parentheses, just as in the alternation example above.
Knowing this, we can change the regex so that really
is allowed more than once:
A (really )+good day
This will now match both A really good day
and A really really good day
. Grouping can be used with any of the quantifiers—question mark, star, the plus sign, and also with ranges, which is what we'll be learning about next.
Ranges
The star and plus quantifiers are a bit crude in that they match an unlimited number of items. What if you wanted to specify exactly how many times something should match. This is where the interval quantifier comes in handy—it allows you to specify a range that defines exactly how many times something should match.
Suppose that you wanted to match both the string Hungry Hippos
as well as Hungry Hungry Hippos
. You could of course make the second Hungry optional by using the regex Hungry (Hungry )?Hippos
, but with the interval quantifier the same effect can be achieved by using the regex (Hungry ){1,2}Hippos
.
The word Hungry
is matched either one or two times, as defined by the interval quantifier {1,2}
. The range could easily have been something else, such as {1,5}
, which would have made the hippos very hungry indeed, as it would match Hungry
up to five times.
Note that the parentheses are required in this case—using Hungry{1,2}
without the parentheses would have been incorrect as that would have matched only the character y
one or two times. The parentheses are required to group the word Hungry
so that the {1,2}
range is applied to the whole word.
You can also specify just a single number, like so:
(Hungry ){2} Hippos
This matches "Hungry" exactly twice, and hence will match the phrase Hungry Hungry Hippos
and nothing else.
The following table summarizes the quantifiers we have discussed so far:
Quantifier |
Meaning |
---|---|
*
|
Match the preceding character or sequence 0 or more times. |
?
|
Match the preceding character or sequence 0 or 1 times. |
+
|
Match the preceding character or sequence 1 or more times. |
{min,max}
|
Match the preceding character or sequence at least min times and at most max times. |
{num}
|
Match the preceding character or sequence exactly num times. |