Pages

boost::regex greedy and not

As an introduction to some of the most interesting boost libraries you can read "Beyond the C++ Standard Library: An Introduction to Boost", by Björn Karlsson, an Addison Wesley Professional book. That's what I'm actually doing, and these are a few notes that I'm jotting down in the meanwhile.

The regex (.*)(\d{2}) means something like: get all the character in the string (first expression) and then two decimal (second expression). The sensitive point is that the first expression is greedy, that means, it happily swallows as many couple of decimal as possible, leaving just the last one available for the second expression.

Sometimes this is not what we expect. Luckly we could modify the meaning of a expression making it not greedy just putting a question mark at its end: (.*?)

In this example we see this difference of behaviour in action:

#include <iostream>
#include <string>
#include "boost/regex.hpp"

using std::cout;
using std::endl;
using std::string;
using boost::regex;
using boost::smatch;
using boost::regex_search;

void r05()
{
smatch m;
string text = "Note that I'm 31 years old, not 32.";

cout << text << endl;

cout << "Greedy expression" << endl;
regex reg("(.*)(\\d{2})");
if(regex_search(text, m, reg))
{
if(m[1].matched)
cout << "(.*) matches: " << m[1] << endl;
if(m[2].matched)
cout << "(\\d{2}) matches: " << m[2] << endl;
}

cout << "Non-greedy expression" << endl;
reg = "(.*?)(\\d{2})";
if(regex_search(text, m, reg))
{
if(m[1].matched)
cout << "(.*) matches: " << m[1] << endl;
if(m[2].matched)
cout << "(\\d{2}) matches: " << m[2] << endl;
}
}

No comments:

Post a Comment