[Xpressive] regex-like boost libraries vs re2c [was: Some timings [was: Baffled about results of search]]

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[Xpressive] regex-like boost libraries vs re2c [was: Some timings [was: Baffled about results of search]]

l_d_allan
Stuart Dootson wrote:
> Lynn - it might be interesting to use re2c (http://re2c.org/) to
> generate regex machines and see what their performance is like -
> after
> all, they do clain 're2c generates warning free code that is equal
> to
> hand-written code in terms of size, speed and quality.' (I'm
> offering
> no opinion on the veracity of that statement....), so (allegedly)
> they'd give a reasonable hand-code analogue....


I took a look at it, and it looks like it could possibly be quite
useful if executable size was a concern .... but it is pretty obscure
(at least to me) .... the only example this dummy could make any sense
out of at all was to recognize a number .... and that was
missing/obscuring the important illustration of how to invoke and use
the generated scanner code.

Were you able to get it to work to do anything even close to
recognizing Days-of-the-Week (full and abbreviations)? Did you find
actual compilable examples with "main" invoking the generated scanners
with a test-string to crunch?

<alert comment="whining">

I really wish developers would provide COMPLETE "baby steps with
training wheels" examples to flatten the learning curve. Eric N. did a
great job of documenting and providing examples for xpressive ....
that could be a model for other projects (not just boost).

I've subscribed to their sourceforge list and will swallow my pride
and ask the newbie question:
"This dummy doesn't expect to be able to figure out your project in 30
minutes ... how do you get it to something basic like recognize:"?

((Sunday|Sun)|(Monday|Mon)....(Saturday|Sat))

</alert>


_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] regex-like boost libraries vs re2c

l_d_allan
>> Stuart Dootson wrote:
>> Lynn - it might be interesting to use re2c (http://re2c.org/) to
>> generate regex machines and see what their performance is like -
>> after
>> all, they do clain 're2c generates warning free code that is equal
>> to
>> hand-written code in terms of size, speed and quality.' (I'm
>> offering
>> no opinion on the veracity of that statement....), so (allegedly)
>> they'd give a reasonable hand-code analogue....
>
> Lynn Allan wrote:
> I took a look at it, and it looks like it could possibly be quite
> useful if executable size was a concern .... but it is pretty
> obscure

re2c seems very fast, and once you get the hang of it, not that
difficult to use. For the contrived test of looking for DayOfWeek
within a 1kb buffer, here are some results ... ymmv (WinXp-Sp2
Microsoft ToolKit-2003 C++ Compiler with /O2 optimize flag)

regex = "(Sunday|Sun)|(Monday|Mon) etc.(Saturday|Sat)";

By-hand parser: Elapsed for 10000 loops Ms: 110.583
By-hand parser: Elapsed for 100000 loops Ms: 1107.81

re2c generator Elapsed for 10000 loops Ms: 69.4683
re2c generator Elapsed for 100000 loops Ms: 700.546

Boost::xpressive-static-iterator: Elapsed for 10000 loops Ms: 410.492
Boost::xpressive-static-iterator: Elapsed for 100000 loops Ms: 4164.45

Boost::regex: Elapsed for 10000 loops Ms: 622.377
Boost::regex: Elapsed for 100000 loops Ms: 5965.24


_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] regex-like boost libraries vs re2c

Eric Niebler

Lynn Allan wrote:

>>>Stuart Dootson wrote:
>>>Lynn - it might be interesting to use re2c (http://re2c.org/) to
>>>generate regex machines and see what their performance is like -
>>>after
>>>all, they do clain 're2c generates warning free code that is equal
>>>to
>>>hand-written code in terms of size, speed and quality.' (I'm
>>>offering
>>>no opinion on the veracity of that statement....), so (allegedly)
>>>they'd give a reasonable hand-code analogue....
>>
>>Lynn Allan wrote:
>>I took a look at it, and it looks like it could possibly be quite
>>useful if executable size was a concern .... but it is pretty
>>obscure
>
>
> re2c seems very fast, and once you get the hang of it, not that
> difficult to use. For the contrived test of looking for DayOfWeek
> within a 1kb buffer, here are some results ... ymmv (WinXp-Sp2
> Microsoft ToolKit-2003 C++ Compiler with /O2 optimize flag)
>
> regex = "(Sunday|Sun)|(Monday|Mon) etc.(Saturday|Sat)";
>
> By-hand parser: Elapsed for 10000 loops Ms: 110.583
> By-hand parser: Elapsed for 100000 loops Ms: 1107.81
>
> re2c generator Elapsed for 10000 loops Ms: 69.4683
> re2c generator Elapsed for 100000 loops Ms: 700.546
>
> Boost::xpressive-static-iterator: Elapsed for 10000 loops Ms: 410.492
> Boost::xpressive-static-iterator: Elapsed for 100000 loops Ms: 4164.45
>
> Boost::regex: Elapsed for 10000 loops Ms: 622.377
> Boost::regex: Elapsed for 100000 loops Ms: 5965.24


Interesting! Can you confirm that re2c is not handling backreferences?
That is, after a match, is there a way to access what the Nth group
matched? Also, do you think you could send around the code that re2c is
generating for this expression?

I'm guessing that re2c is generating a DFA. Boost.Regex and Xpressive
generate NFAs because DFAs aren't suited to doing backreferences[*].
I've considered adding DFA support to Xpressive, and use DFAs for those
regexes that don't need the full power of NFAs. Clearly, the performance
win would be worth the trouble. This would not be a trivial undertaking,
however.

[*] Technically, DFAs only have a problem with patterns such as
"(.)\\1"; that is, when the result of the backreference is used within
the pattern itself.

--
Eric Niebler
Boost Consulting
www.boost-consulting.com
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] regex-like boost libraries vs re2c [was: Some timings [was: Baffled about results of search]]

Stuart Dootson
In reply to this post by l_d_allan
On 4/13/06, Lynn Allan <[hidden email]> wrote:

> Stuart Dootson wrote:
> > Lynn - it might be interesting to use re2c (http://re2c.org/) to
> > generate regex machines and see what their performance is like -
> > after
> > all, they do clain 're2c generates warning free code that is equal
> > to
> > hand-written code in terms of size, speed and quality.' (I'm
> > offering
> > no opinion on the veracity of that statement....), so (allegedly)
> > they'd give a reasonable hand-code analogue....
>
> Were you able to get it to work to do anything even close to
> recognizing Days-of-the-Week (full and abbreviations)? Did you find
> actual compilable examples with "main" invoking the generated scanners
> with a test-string to crunch?

Nope - never used it - just know of it. I've only needed to discard
Boost.Regex in a few (profiler indicated) cases, at which point I've
hand-coded a suitable routine.

Stuart Dootson
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] regex-like boost libraries vs re2c

l_d_allan
In reply to this post by Eric Niebler
> Interesting! Can you confirm that re2c is not handling
> backreferences?
> That is, after a match, is there a way to access what the Nth group
> matched? Also, do you think you could send around the code that re2c
> is generating for this expression?

This regex newbie is ignorant about what it means to "back-reference."
My usage (so far) has been recognizing one type of expression at a
time, like:
((Sunday|Sun)|(Monday|Mon) ...etc.(Saturday|Sat))
or
ZipCode ##### or #####-####

It seems to be able to get part way thru the longer possibility, and
then "settle for" the shorter "abbreviation". That probably isn't
back-referencing.

ZipCode 00000-000 is "almost" #####-####, but isn't, so it recognizes
#####.
MondX is "almost" Monday, but isn't, so it recognizes Mon, which is
also group=2.

It is relatively straightforward to get the results of a single match
and do what you want with it ... I don't have experience with
untangling something more complicated like a multi-piece date/time:
MMM DDD, YYYY hh:mm:ss [ap].m

> I'm guessing that re2c is generating a DFA. Boost.Regex and
> Xpressive
> generate NFAs because DFAs aren't suited to doing backreferences[*].
> I've considered adding DFA support to Xpressive, and use DFAs for
> those regexes that don't need the full power of NFAs. Clearly, the
> performance win would be worth the trouble. This would not be a
> trivial undertaking, however.
>
> [*] Technically, DFAs only have a problem with patterns such as
> "(.)\\1"; that is, when the result of the backreference is used
> within
> the pattern itself.

Re2c generates VERY gnarly code, full of goto's and labels ... but the
compiler is happy and the optimizer seems to straighten everything out
into fast object code. Re2c is apparently part of PHP.

yy4:
      yych = *++YYCURSOR;
      goto yy3;
yy5:
      yych = *++YYCURSOR;
      if(yych <= '/') goto yy6;
      if(yych <= '9') goto yy7;
yy6:
      YYCURSOR = YYMARKER;
      switch(yyaccept){
      case 1:  goto yy10;
      case 0:  goto yy3;
      }
yy7:
      yych = *++YYCURSOR;
      if(yych <= '/') goto yy6;
      if(yych >= ':') goto yy6;
      yych = *++YYCURSOR;
      if(yych <= '/') goto yy6;


_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] regex-like boost libraries vs re2c

l_d_allan
In reply to this post by Eric Niebler
> regex = "(Sunday|Sun)|(Monday|Mon) etc.(Saturday|Sat)";
>
> By-hand parser: Elapsed for 10000 loops Ms: 110.583
> By-hand parser: Elapsed for 100000 loops Ms: 1107.81
>
> re2c generator Elapsed for 10000 loops Ms: 69.4683
> re2c generator Elapsed for 100000 loops Ms: 700.546
>
> Boost::xpressive-static-iterator: 10000 loops Ms: 410.492
> Boost::xpressive-static-iterator: 100000 loops Ms: 4164.45

Eric Niebler wrote:
> Interesting!

I did some HiResTimer comparisons to strstr and wonder if the results
are credible ... the re2c numbers are much closer to strstr than I
expected. (also, these reflect some "tweaking" since the previous
email that recognized DayOfWeek rather than ZipCode, and the numbers
below are about 40% faster than previous)

const char* pzStrToScan_Re2cSearch =
    "12345 at pos=0 and "
    "another zip-code 12345-6789 at pos=36 and "
    "another 98765-4321 at pos=69 and "
    "another at end=113 11223-3445";

const char* pzStrToScan_strstr =
    "12345 at pos=0 and "
    "another zip-code 12345-6789 at pos=36 and "
    "another 12345-4321 at pos=69 and "
    "another at end=113 12345-3445";

WinXp-Sp2 vc7.1 on AMD-3700

10,000 loops thru above to find 40,000 matches
strstr just looking for 12345:         3.2 milliseconds
Re2cSearch looking for [0-9]{5}(-[0-9]{4})? :  5.3 ms

100,000 loops thru above to find 400,000 matches
(mostly to verify that optimizer isn't distorting the results)
strstr just looking for 12345:        31.8 milliseconds
Re2cSearch looking for [0-9]{5}(-[0-9]{4})? :  53.8 ms

The re2c generated code also passes a relatively extensive
cppunit-like test.

> Also, do you think you could send around the code that re2c
> is generating for this expression?

Here is a link to a .zip with the vc6 and vc7.1 projects (vc8 to
follow):
http://cleanspeech.sf.net/misc/re2c_ZipCodeRe_060419.zip

Feedback appreciated, especially if I've messed up and the numbers are
flawed (not unlikely).

(Also, there is some "hold your nose" code in this C prototype. The
intent is to eventually be able to use the relatively simple ZipCodeRe
as a "getting up to speed" example for re2c newbies (like myself), and
perhaps as a template/clone for other 'recognizers'.)



_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users