Boost Regex Back Reference Issue

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Boost Regex Back Reference Issue

Boost - Users mailing list
According to the Boost documentation for Back references
(http://www.boost.org/doc/libs/1_64_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.back_references) which seems to be the same at least from 1.37 - 1.64.

Quoting a piece of the documentation:

-----

For example the expression:

   ^(a*).*\1$

Will match the string:

   aaabbaaa

But not the string:

   aaabba

-----


I'm finding two issues with the example cited.


1.  It seems to me that the example of the string which should not
match, should actually match.  Ultimately, shouldn't the engine match
the marked sub-expression with the first 'a' in order to satisfy the
backreference?
I tested this example with Oniguruma and with PHP's PCRE and they both
matched the string noted here to not match.

But also, since the marked sub-expression is (a*) then I wonder what the
behavior would be if it couldn't make a match on 'a', since the '*' will
allow for zero matches.  In fact, it seems like everything in the
pattern is effectively "optional" due to the '*' operator.

I'm a novice with Perl, but unless I made a mistake, it will match
unconditionally:

print "It matches\n" if "aaabba" =~ /^(a*).*\1$/;
It matches
print "It matches\n" if "aaabbac" =~ /^(a*).*\1$/;
It matches
print "It matches\n" if "" =~ /^(a*).*\1$/;
It matches
print "It matches\n" if "x" =~ /^(a*).*\1$/;
It matches
print "It matches\n" if "xyz" =~ /^(a*).*\1$/;
It matches
print "It matches\n" if "123" =~ /^(a*).*\1$/;
It matches


2.  The other issue is that when I try this example with the string
which is posted to not match, the Boost regex engine runs for a while
and ultimately crashes with a memory error.  (seems like it might be an
endless loop of some sort).  Is that a bug?


Nick


_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Boost Regex Back Reference Issue

Boost - Users mailing list
Nick wrote:

> According to the Boost documentation for Back references [...]

I use regex's frequently in Perl, C++ and other languages. I use Boost Regex exclusively in C++, even though I probably should be using std::regex instead. I'm much happier using Boost Regex and std::string than I was with PCRE and char*.

I only read the Boost regex docs for the API, not for regex syntax.

> For example the expression:
>
>    ^(a*).*\1$
>
> Will match the string:
>
>    aaabbaaa
>
> But not the string:
>
>    aaabba

This example doesn't make sense to me, either, for same reasons that you mentioned: Both strings begin and end with "a", and a* can match 0 or more characters.

I tested this example on Perl 5.22, adding the string "unmatchable" to the test input:

    $ perl -nle 'print "$_ ($1)" if /^(a*).*\1$/' <<HERE
    > aaabbaaa
    > aaabba
    > unmatchable
    > HERE

The output shows both the input string and the submatch captured by $1 and indicates that the pattern will match any input string:

    aaabbaaa (aaa)
    aaabba (a)
    unmatchable ()

I haven't tested this with Boost Regex or any other C++ library, but I believe that this example should be edited or replaced. I would suggest changing (a*) to (a+) and "aaabba" to "aaabbb". Alternatively, you could change (a*) to (a{2}) for the original input strings, but (a+) makes simpler example.

> 2.  The other issue is that when I try this example with the string which is posted to not match, the Boost
> regex engine runs for a while and ultimately crashes with a memory error.  (seems like it might be an endless
> loop of some sort).  Is that a bug?

Ouch! I'm glad nothing like that has happened to me.

It's generally a bad idea to use * too frequently in a regex, especially .* . Adding a back reference to a* only aggravates the issue. I had an experience years ago where too many .*'s killed the performance of a Perl script, and probably consumed way too much memory.

|+|  M a r k  |+|

Mark Stallard
Engineering & Operations Application Development
Business Application Services
Global Business Services Information Technology

Raytheon Company
Billerica, MA (US)

This message contains information that may be confidential and privileged. Unless you are the addressee (or authorized to receive mail for the addressee), you should not use, copy or disclose to anyone this message or any information contained in this message. If you have received this message in error, please so advise the sender by reply e-mail and delete this message. Thank you for your cooperation.
 


-----Original Message-----
From: Boost-users [mailto:[hidden email]] On Behalf Of Nick via Boost-users
Sent: Thursday, July 13, 2017 10:04 AM
To: [hidden email]
Cc: Nick <[hidden email]>
Subject: [Boost-users] Boost Regex Back Reference Issue

According to the Boost documentation for Back references
(http://www.boost.org/doc/libs/1_64_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.back_references) which seems to be the same at least from 1.37 - 1.64.

Quoting a piece of the documentation:

-----

For example the expression:

   ^(a*).*\1$

Will match the string:

   aaabbaaa

But not the string:

   aaabba

-----


I'm finding two issues with the example cited.


1.  It seems to me that the example of the string which should not match, should actually match.  Ultimately, shouldn't the engine match the marked sub-expression with the first 'a' in order to satisfy the backreference?
I tested this example with Oniguruma and with PHP's PCRE and they both matched the string noted here to not match.

But also, since the marked sub-expression is (a*) then I wonder what the behavior would be if it couldn't make a match on 'a', since the '*' will allow for zero matches.  In fact, it seems like everything in the pattern is effectively "optional" due to the '*' operator.

I'm a novice with Perl, but unless I made a mistake, it will match
unconditionally:

print "It matches\n" if "aaabba" =~ /^(a*).*\1$/; It matches print "It matches\n" if "aaabbac" =~ /^(a*).*\1$/; It matches print "It matches\n" if "" =~ /^(a*).*\1$/; It matches print "It matches\n" if "x" =~ /^(a*).*\1$/; It matches print "It matches\n" if "xyz" =~ /^(a*).*\1$/; It matches print "It matches\n" if "123" =~ /^(a*).*\1$/; It matches


2.  The other issue is that when I try this example with the string which is posted to not match, the Boost regex engine runs for a while and ultimately crashes with a memory error.  (seems like it might be an endless loop of some sort).  Is that a bug?


Nick


_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Boost Regex Back Reference Issue

Boost - Users mailing list
In reply to this post by Boost - Users mailing list


On 13/07/2017 15:03, Nick via Boost-users wrote:

> According to the Boost documentation for Back references
> (http://www.boost.org/doc/libs/1_64_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html#boost_regex.syntax.perl_syntax.back_references) which seems to be the same at least from 1.37 - 1.64.
>
> Quoting a piece of the documentation:
>
> -----
>
> For example the expression:
>
>     ^(a*).*\1$
>
> Will match the string:
>
>     aaabbaaa
>
> But not the string:
>
>     aaabba
>
> -----
>
>
> I'm finding two issues with the example cited.
>
>
> 1.  It seems to me that the example of the string which should not
> match, should actually match.  Ultimately, shouldn't the engine match
> the marked sub-expression with the first 'a' in order to satisfy the
> backreference?
> I tested this example with Oniguruma and with PHP's PCRE and they both
> matched the string noted here to not match.

Congratulations, you've just spotted a... ahem... 20 year old bug in the
documentation!
>
> 2.  The other issue is that when I try this example with the string
> which is posted to not match, the Boost regex engine runs for a while
> and ultimately crashes with a memory error.  (seems like it might be an
> endless loop of some sort).  Is that a bug?

Works just fine for me here, if you have a complete self-contained test
case (code, plus compiler version, platform and boost version please)
I'll look into it.

Best, John.

---
This email has been checked for viruses by AVG.
http://www.avg.com

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Boost Regex Back Reference Issue

Boost - Users mailing list

On Thu, 2017-07-13 at 19:20 +0100, John Maddock via Boost-users wrote:

> Congratulations, you've just spotted a... ahem... 20 year old bug in the
> documentation!

Should I do something explicit for it to get fixed (bug ticket, etc), or
has that already been set in motion from this e-mail?


> > 2.  The other issue is that when I try this example with the string
> > which is posted to not match, the Boost regex engine runs for a while
> > and ultimately crashes with a memory error.  (seems like it might be an
> > endless loop of some sort).  Is that a bug?
>
> Works just fine for me here, if you have a complete self-contained test
> case (code, plus compiler version, platform and boost version please)
> I'll look into it.

If it's working for you I'll take a closer look and try to get a small
repro.  This is happening with GCC 4.9 on Linux (Ubuntu 14 & Gentoo)
with Boost 1.61.


Nick


_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Boost Regex Back Reference Issue

Boost - Users mailing list
Apologies for replying to my own email.

On Fri, 2017-07-14 at 10:55 -0400, Nick wrote:

> On Thu, 2017-07-13 at 19:20 +0100, John Maddock via Boost-users wrote:
> > > 2.  The other issue is that when I try this example with the string
> > > which is posted to not match, the Boost regex engine runs for a while
> > > and ultimately crashes with a memory error.  (seems like it might be an
> > > endless loop of some sort).  Is that a bug?
> >
> > Works just fine for me here, if you have a complete self-contained test
> > case (code, plus compiler version, platform and boost version please)
> > I'll look into it.
>
> If it's working for you I'll take a closer look and try to get a small
> repro.  This is happening with GCC 4.9 on Linux (Ubuntu 14 & Gentoo)
> with Boost 1.61.

So I found the cause.  It seems to be caused by the regex pattern and my
code's design.  My pseudo code:

    while boost::regex_search {
        // Store match(es)

        // Move start iterator to end of matches[0]
    }

Well, given the regex:

    ^(a*).*\1$

we've already agreed that this regex is effectively all optional which
means it will match anything, including ^$ (empty string), and I've
confirmed this behavior with Perl.

So the first iteration matches as we'd expect:
    [0] = aaabba, [1] = a

Then all subsequent iterations match on the empty string because the
starting iterator was moved to the end of the first match (which is also
the end of the string) so the reduced regex ^$ matches indefinitely:
    [0] = "", [1] = ""

Since my loop is adding the matches for each iteration, it eventually
hits a bad_alloc exception.

So I guess this is by design for Boost Regex.  Obviously my loop can be
adjusted to bail if start_iterator == end_iterator.  But I wonder if
there's something else my loop should be accounting for?  I suppose
there's nothing that boost::regex_search can do for this situation?


Nick


_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Loading...