Nell Shamrell works as a Software Development Engineer for Blue Box. She also sits on the advisory board for the University of Washington Certificate in Ruby Programming. She specializes in Ruby, Rails, and Test Driven Development. Prior to entering the world of software development, she studied and worked in the field of Theatre. The world of Theatre prepared her well for the dynamic world of creating software applications. In both, she strives to create a cohesive and extraordinary experience. In her free time she enjoys practicing the martial art Naginata.
Regular expressions are powerful. As a famous superhero once said, with great power comes great responsibility. To keep a regular expression from causing havoc, you need to know how to control its behavior.
Regular expressions have three distinct, recognizable behaviors: greedy, lazy, and possessive. These words sound pretty negative but they’re not necessarily bad ways for your regular expression to behave. These are descriptions of different types of behavior your regular expression might have. Behavior you can recognize and control! I’m going to show you how.
To understand these regular expression attitudes, we need to understand quantifiers. Quantifiers simply tell your regular expressions engines how many times a character or group of characters should appear in your string. One of the quantifiers I use the most is the "+" quantifier. When I add this to a character, it means that character needs to appear at least one time. It can appear as many times as it wants, but it needs to be there at least once.
So this regular expression
would match any character appearing at least once. It guarantees a character will be there.
Quantifiers lie at the root of whether your regular expression is greedy, lazy, or possessive. By default, they’re greedy.
A greedy quantifier tries to match as much of the string as it can. It grabs as much of the string as it can get its greedy little hands on and tries to make a match. If the whole string doesn’t work, it backs up one character and tries again. It repeats this process until there are no more characters for it to test.
Greedy quantifiers use maximum effort for maximum return. A greedy quantifier will try as many ways as it can to find a match and will return the maximum characters that could possibly be a part of that match.
Let’s look at an example. I’m going to switch science fiction fandoms for this next string and use a quote from Star Trek: First Contact:
string = "There’s no time to talk about time we don’t have the time" /.+time/
This regular expression matches any character appearing at least once followed by the word "time." Now if I run the match method on this regular expression, passing in the string:
/.+time/.match(string) => "There’s no time to talk about time we don’t have the time"
It matches our entire string.
When this regular expression sees the string, it tries to match the first part of the regular expression, the ".+," first. This matches the entire string. Then it tries to match the second part of our regular expression, the word "time." Because it already has the entire string marked as a match, it’s going to first look for the word "time" beyond the end of the string. It’s not going to find it since there’s nothing there, so it backtracks. It moves back one character at a time until it finds a match. When it finds it, it returns the whole match. In this case, it’s our entire string.
Greedy quantifiers try to match the whole string, then backtrack. Backtracking means if the entire string doesn’t match the entire regular expression, it will try as many ways as possible to find a match. It needs to keep track of what ways it’s tried so it doesn’t repeat them. This can potentially take up a lot of system resources, particularly when you have multiple matches running on large amounts of text.
Oniguruma has optimizations that make backtracking quicker. Patrick Shaughnessy has a fantastic blog post that goes into the details of how Oniguruma handles backtracking. Even with optimizations, however, a greedy regular expression will chew through a lot of resources.
When you want a more contained match that uses much less resources, you want a lazy quantifier. Also known as a reluctant quantifier, it starts at the very beginning of the string and tries to make a match with the very first character. If it doesn’t find a match, it grabs another character. As an absolute last resort, it will grab the whole string to try and find a match.
A lazy quantifier uses minimum effort for minimum return. It returns as few characters as possible to make match. If it finds a match in the first character of the string, it will return just the first character. It’s lazy. It does just enough to get by, nothing more.
You make a quantifier lazy simply by adding a question mark after it.
If I run the match method on my string using this lazy regular expression
/.+?time/.match(string) => "There’s no time"
I only get "There’s no time" back. It started at the very beginning of the string and delivered just enough to be a match. Lazy regular expressions use much less backtracking and, therefore, fewer resources than greedy regular expressions.
What if you do want to match as many characters as possible, but don’t want backtracking to consume your resources? There’s a third kind of quantifier, possessive quantifiers. These are all or nothing. Either there’s a match on the first try or they fail. Like a greedy quantifier, they grab as much of the string as they can - the entire string- and try to make a match. If that match fails, though, they won’t backtrack or try again.
Possessive quantifiers use minimum effort for maximum return. They try to return as many characters as possible for the bare minimum effort - they give it one go then give up.
To make a quantifier possessive, you add a plus sign to it:
Let’s run match on the string using this possessive regular expression
/.++time/.match(string) => nil
It returns nil. The match fails. Why would it fail? It seems like our entire string should match this regular expression. The reason this fails is because there is no backtracking.
The first thing our regular expression tries to match is the ".+." This matches the entire string. When it tries to match the second part of our regular expression, "time", it already has the entire string marked as a match for ".+." It looks for the word "time" AFTER our entire string. This is the same thing a greedy quantifier does, but a greedy quantifier can go back earlier in the string and look for a match. A possessive quantifier can’t go back in the string to look for a match because it can’t backtrack. Therefore, it fails.
The main advantage Possessive quantifiers offer is they fail fast. They don’t backtrack, so they use minimal resources. A greedy quantifier will try every possible way to try to make a match. If it fails, all that work, all those resources, will be for nothing. A possessive quantifier prevents this. If it’s going to fail, it fails quickly.
Generally, you only want to use possessive quantifiers for very small regular expressions, usually when you have small sub expressions nested within larger expressions. They’re very useful, but use with caution.
Regular expressions can seem overwhelmingly complex. When I was learning to go beyond the basics of regular expressions, beyond the few tricks I knew for things like email validation, I found it helpful to think of it as writing a subprogram in another language. In fact, that’s exactly what it is. You are writing another program, in another language, within Ruby itself.
Like any programming language, it really helps to write your regular expressions in small chunks. If I’m writing a lookbehind, I first want to write the main pattern I will be matching. I’ll make sure this works. Then I will write a pattern for the lookbehind and make sure it works on it’s own. Only after I’ve confirmed that both patterns work will I put them together and test.
Rubular is a fantastic site for both composing and testing your regular expressions. Check it out. Use it. It’s made my life so much easier.
Like any software program, regular expressions come in drafts. When you’re developing a regular expression, it’s okay for it to be ugly at first. Get it working and, once you have it working, refactor. It’s the same red, green, refactor process we use in test driven development.
Regular expressions are powerful. So powerful they inspire fear in many of us. That fear can be overcome. As cryptic as they might seem, they do have a logical reasoning and structure. Use them. Fire up Rubular and try some lookaheads and lookbehinds, experiment with greedy, lazy, and possessive quantifiers. Explore the fantastic ways Ruby works with regular expressions. I think you’ll be amazed at what you find.
Have you read the news?
Rumor has it this cloud thing is taking off.