Regular expressions are powerful, but with great power comes great responsibility. Because of the way most regex engines work, it is surprisingly easy to construct a regular expression that can take a very long time to run. In my previous post on regex performance, I discussed why and under what conditions certain regexes take forever to match their input. If my last post answered the question of why regexes are sometimes slow, this post aims to answer the question of what to do about it, as well as show how much faster certain techniques can make your regexes1.
Like my previous post, this post assumes you are somewhat familiar with regexes. Check out this excellent site if you need an intro, a refresher, or clarification on some of the techniques discussed below.
Finally, keep in mind that different regex engines work in different ways and incorporate different optimizations. The following tricks will likely help the performance of your regexes. To avoid needlessly obfuscating your regexes with performance enhancements that make no real difference, I urge you to benchmark your regular expressions with a set of your expected input. Don’t forget to include matching and non-matching input if you expect to have both. Try each of the techniques and see for yourself which one offers the best performance boost.
Without further ado, here are five regular expression techniques that can dramatically reduce
processing time:
- Character classes
- Possessive quantifiers (and atomic groups)
- Lazy quantifiers
- Anchors and boundaries
- Optimizing regex order
Character Classes
This is the most important thing to keep in mind when crafting performant regexes. Character classes specify what characters you are trying, or not trying, to match. The more specific you can be here, the better. You should almost always aim to replace the . in your.*
s with something more specific. The.*
will invariably shoot to the end of your line (or even your whole input if you have dot all enabled) and will then backtrack. When using a specific character class, you have control over how many characters the * will cause the regex engine to consume, giving you the power to stop the rampant backtracking.
To demonstrate this, let’s consider the two regular expressions:
1.field1=(.*) field2=(.*) field3=(.*) field4=(.*).*
2.field1=([^ ]*) field2=([^ ]*) field3=([^ ]*) field4=([^ ]*).*
I ran a (quick and dirty) benchmark against the following inputs:
1.field1=cat field2=dog field3=parrot field4=mouse field5=hamster
2.field1=cat dog parrot mouse
3.field1=cat field2=dog field3=parrot field5=mouse
This benchmark and all the other benchmarks in this post were conducted in the same way. Each regex was fed each input 1,000,000 times and the overall time was measured on average. These are the numbers I got for this particular experiment:
Regex 1 (the .* one) |
Regex 2 (the character class one) | Performance improvement | |
---|---|---|---|
Input 1 (matching) | 3606ms | 736ms | 79.6% |
Input 2 (not matching) | 591ms | 225ms | 61.9% |
Input3(almost matching) | 2520ms | 597ms | 76.3% |
Here we can see that even with matching input, the vague dot starry regex takes way longer. In all cases, the specific regex performed way better. This will almost always be the case no matter what your regex is and no matter what your input is. Specificity is the number one way to improve the performance of your regexes. Just say that over and over again. Like a mantra.
Possessive Quantifiers (and Atomic Groups)
Possessive quantifiers (denoted with a +) and atomic groups (?>…) both do the same thing: once they consume text, they will never let it go. This can be nice for performance reasons because it cuts down on the backtracking that regexes are wont to do so much of. Generally speaking, though, you may be hard pressed to find a use case where atomic groups will be a real game changer in terms of performance. This is because the main performance heavy hitter is the infamous .* that causes lots of backtracking. If you changed the .* to a .*+ to make it possessive, you eliminate all backtracking, but you can’t matching anything else after that point since the + never gives back any text. Thus, your regex already has to be fairly specific in order to even use atomic groups; therefore, your performance boost will be incremental. Nonetheless, the possessive quantifier can still be surprisingly helpful. Consider these two regexes to match an IPv4 address:
1.^(d{1,3}.d{1,3}.d{1,3}.d{1,3}).*
2.^(d{1,3}+.d{1,3}+.d{1,3}+.d{1,3}+).*
on the following two inputs:
1.107.21.20.1 - - [07/Dec/2012:18:55:53 -0500] "GET /" 200 2144
2.9.21.2015 non matching text that kind of matches
When matching the non-matching text, the regex without the possessive quantifier consumes the first few characters and, on not seeing a match, it backtracks all the characters one by one hoping to still find a match. With the possessive quantifier, as soon as the regex doesn’t find a match, it stops looking and doesn’t bother backtracking.
When running my benchmark on these regexes, I got the following results:
How much of a boost in performance you’ll get from this is use-case specific, but if you can use the atomic quantifier, then you should as it can pretty much only help.
Lazy Quantifiers
The lazy quantifier is a powerful performance booster. In many naive regexes, greedy quantifiers (*’s
) can be safely replaced by lazy quantifiers (*?’s
), giving the regex a performance kick without changing the result.
Consider the following example. When given the input
# Query_time: 0.304 Lock_time: 0.81 Rows_sent: 1 Rows_examined: 1 Rows_affected: 0 Rows_read: 4505295
and the greedy regex:
.* Lock_time: (d.d+) .*
the regex engine would first shoot to the end of the string. It then backtracks until it gets to Lock_time, where it can consume the rest of the input. The alternative lazy regex
.*? Lock_time: (d.d+) .*
would consume starting from the beginning of the string until it reaches Lock_time, at which point it could proceed to match the rest of the string. If the Lock_time field appears toward the beginning of the string, the lazy quantifier should be used. If the Lock_time field appears toward the end, it might be appropriate to use the greedy quantifier.
Some regex performance guides will advise you to be wary when using the lazy quantifier because it does its own kind of backtracking. It consumes one character at a time and then attempts to match the rest of the regex. If that fails, it “backtracks” and moves the cursor one character over and repeats. This can sometimes make the lazy star not at all faster or even slower than the greedy star. I saw this slight performance degradation in only one of my benchmarks.
I ran my benchmark on the following three inputs. The first input matches toward the beginning, the second input matches toward the end, and the third input doesn’t match at all.
1.# Query_time: 0.304 Lock_time: 0.81 Rows_sent: 1 Rows_read: 4505295 Rows_affected: 0 Rows_examined: 1
2.# Query_time: 0.304 Rows_sent: 1 Rows_read: 4505295 Rows_affected: 0 Lock_time: 0.81 Rows_examined: 1
3.# Query_time: 0.304 Rows_sent: 1 Query_time: 0.304 Rows_sent: 1 Query_time: 0.304 Rows_sent: 1 Rows_examined: 1
I matched against the two regexes mentioned above:
1..*Lock_time: (d.d+).*
2..*?Lock_time: (d.d+).*
The performance characteristics change when you add more .*s. Consider these regexes which match two fields:
3..*Lock_time: (d.d+).*Rows_examined: (d+).*
4..*?Lock_time: (d.d+).*?Rows_examined: (d+).*
I ran the benchmark against the same inputs and got these results:
Given these results, I’d say it’s generally a good idea to use the lazy quantifier wherever possible, but it is still important to benchmark just to be sure, as different regex engines optimize in different ways.
Anchors and Boundaries
Anchors and boundaries tell the regex engine that you intend the cursor to be in a particular place in the string. The most common anchors are ^ and $, indicating the beginning and end of the line (as opposed to A and which match the beginning and end of the input). Common boundaries include the word boundary and non-word boundary B. For example, http matches http but not https. These techniques are useful when crafting regexes that are as specific as possible.
This is a pretty simple example, but it should serve as a reminder to use anchors whenever possible considering the impact it has on performance.
Here are two regexes to find an IPv4 address.
1.d{1,3}.d{1,3}.d{1,3}.d{1,3}
2.^d{1,3}.d{1,3}.d{1,3}.d{1,3}
The second regex is specific about the IP address appearing at the beginning of the string.
We’re searching for the regex in input that looks like this:
107.21.20.1 - - [07/Dec/2012:18:55:53 -0500] "GET /extension/bsupport/design/cl/images/btn_letschat.png HTTP/1.1" 200 2144
Non-matching input would look something like this:
[07/Dec/2012:23:57:13 +0000] 1354924633 GET "/favicon.ico" "" HTTP/1.1 200 82726 "-" "ELB-HealthChecker/1.0"
Here are the results of my benchmark:
Regex 2 of course runs much faster on non-matching input because it throws out the non-matching input almost immediately. In short, if you can use an anchor or a boundary, then you should because they can pretty much only help the performance of your regex.
Order Matters
Here I am talking about the ordering of alternations—when a regex has two or more valid options separated by a | character. Order will also matter if you have multiple lookaheads or lookbehinds. The idea is to order each option in the way that will minimize the amount of work the regex engine will need to do. For alternations, you want the most common option to be first, followed by the rarer options. If the rarer options are first, the regex engine will waste time checking those before checking the more common options which are likelier to succeed. For multiple lookaheads and lookbehinds, you want the rarest to be first, since all lookaheads and lookbehinds must match for the regex to proceed. If you start with the one that is least likely to match, the regex will fail faster.
This one is a bit of a micro-optimization, but it can give you a decent boost depending on your use case, and it can’t possibly hurt because the two expressions are equivalent. I ran a benchmark on the following two regexes:
1..*(?<='field5' : '|"field5" : ")([^'"]*).*
2..*(?<="field5" : "|'field5' : ')([^"']*).*
On the following input:
{"field1" : "wool", "field2" : “silk", "field3" : "linen", "field4" : "merino", "field5" : "alpaca"}"
I’m searching for a json field “field5” and I check to see if the json is formatted with double quotes or single quotes. Since double quotes are far more common in json, the option to check double quotes should be first. The benchmark showed the following performance difference:
Concluding Thoughts
Regex performance is an interesting topic. For most people, regexes are whipped out only in special circumstances to solve a very specific type of problem. Normally, it doesn’t matter if a regex is a bit slower than it could be. Many people who develop very latency-sensitive applications avoid regexes as they are notoriously slow. If a regex is really the only tool to get the job done but it must be blazingly fast, your options are to use a regex engine that is backed by the Thompson NFA algorithm4 (and, consequently, to say goodbye to back references) or to live and breathe the time-saving regex techniques in this post. Lastly, as is always the case when optimizing performance, benchmarking is key. Regex performance depends heavily on the input and the regex. Your benchmark should use the same regex engine and should measure against input that is similar to what you expect to match in your production application.
I hope that these posts have made you wiser and that your regexes are now much defter. You are blessed now with the knowledge of what makes a good regex and what makes a bad regex. Equipped with these new instruments and knowledge, you are ready to craft your own powerful, yet efficient regular expressions. Happy regexing!