Google Releasing New Open Source Regex Library

As this post on the Google open source blog, regular expressions are pretty fundamental to much advanced hackery. Most implementations, however, share a fairly horrific worst case in terms of time and space complexity. That is to say, under certain circumstances, evaluating a regular expression over a large enough amount of input could hang forever, gobble up all memory allocated to its thread or threads, or both.

Google has been gobbling up the brightest computer scientists and software engineers at a prodigious pace. RE2, their new regular expressions library, is the latest dividend of their carefully hoarded brain trust. The search giant claims their improved implementation completes in linear time and won’t overflow the stack space of its threads. Put simply, they’ve fixed the potential worst case for this critical Swiss Army knife software tool.

Like Go, this is another stab at speeding up the fundamentals most of us building software take for granted. Beyond the worst case, I would imagine such a startlingly different approach may also speed up the more typical cases.

The release is available under a very permissive BSD license. It won’t take long for the wider community to vet their claims. If RE2 lives up to its promise, I imagine it won’t take long to spread far and wide.

Update: See Philip’s comment below.  The predictable performance of RE2 comes at a considerable price: it doesn’t support back references.  Back references let you re-use part of an expression, the value a group actually matched, later in the expression.  They are intensely useful and based on my own experience, I would hesitate quite a bit more at the trade off Russ Cox suggests is so simple.

5 Replies to “Google Releasing New Open Source Regex Library”

  1. “RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently” according to http://code.google.com/p/re2 and backreferences are all grayed out as not supported by RE2 on http://code.google.com/p/re2/wiki/Syntax . Will RE2 spread far and wide? Well. . . do you want backreferences?

    In his paper Regular Expression Matching Can Be Simple And Fast, Russ Cox, the Google engineer who also authored the press release, says, “Some might argue that this test is unfair to the backtracking implementations, since it focuses on an uncommon corner case. This argument misses the point: given a choice between an implementation with a predictable, consistent, fast running time on all inputs or one that usually runs quickly but can take years of CPU time (or more) on some inputs, the decision should be easy.”

    I guess so. . . it really comes down to features versus performance.

    1. Had I not been so tired when writing up this post, the breathless tone would have raised a red flag. Still, if you have some application where you don’t need back references but need greater speed/scalability, I suppose it is nice to have this option.
      As it stands, I retract the implication that this is a suitable replacement for even a majority of the uses to which existing implementations have been put.

  2. Interestingly, I see the article.
    The sites about RE2:
    http://swtch.com/~rsc/regexp/regexp1.html etc

    I read contents and test some, but there are questions.

    See entire benchmark graph, almost RE2 is faster than PCRE.
    Is there result just confirm only with matching?
    Is there result about parsing data for other use?

    I do some test PCRE and RE2 samples.

    ============PCRE============
    rc = pcre_exec(
    regex, /* compiled regular expression */
    NULL, /* optional results from pcre_study */
    text, /* input string */
    (int)strlen(text), /* length of input string */
    0, /* starting position in input string */
    0, /* OR’d options */
    capturevector, /* holds results of capture groups */
    CAPTUREVECTORSIZE);
    =============================

    ============RE2============
    RE2::FullMatch(str, pat);
    =============================

    If test above, the result is same graph the website.

    But I want parsing data, so I make following code to parsing data.
    And I measure time how long these take.
    The result is PCRE is faster than RE2

    ============PCRE============
    rc = pcre_exec(
    regex, /* compiled regular expression */
    NULL, /* optional results from pcre_study */
    text, /* input string */
    (int)strlen(text), /* length of input string */
    0, /* starting position in input string */
    0, /* OR’d options */
    capturevector, /* holds results of capture groups */
    CAPTUREVECTORSIZE);

    for (j = 1; j < rc; j++)
    {
    char *substring_start = text + capturevector[2*j];
    int substring_length = capturevector[2*j+1] – capturevector[2*j];
    }
    =============================
    ============RE2============
    RE2::FullMatch(str, pat2,
    &s[0], &s[1], &s[2], &s[3], &s[4],
    &s[5], &s[6], &s[7], &s[8], &s[9],
    &s[10], &s[11], &s[12]);
    =============================

    Is there some wrong code for testing?

    Or website's benchmark data is only matching use?

    Best Regards

Leave a Reply

Your email address will not be published. Required fields are marked *