[Xpressive] Baffled about results of search

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[Xpressive] Baffled about results of search

l_d_allan
<alert comment="xpressive newbie">

vc7.1
xpressive from "vault" as of today (April 11)

With the statements below (more or less based on Example 1 and Example
2 in the sample code provided with xpressive), I'm attempting to
detect which "group" of subexpressions matched, and getting results
that I don't understand (not unexpected for a newbie <g>).

std::string altDow( "Alternate days of the week are Tue and Thursday
and Sat and Monday." );

sregex rex = sregex::compile( "((Sunday|Sun)"
      "|(Monday|Mon)"
      "|(Tuesday|Tue)"
      "|(Wednesday|Wed)"
      "|(Thursday|Thu)"
      "|(Friday|Fri)"
      "|(Saturday|Sat))" );

The first regex_search sort of works, but I'm unclear what field in
the smatch:what variable to use to tell what matched, and some
oddities are happening.
regex_search( altDow, what, rex )

Length(0): 3
Position(0): 31
what[0]: Tue
what[1]: Tue
what[2]:
what[3]:
what[4]: Tue

The subsequent searches give valid information about the length and
position, but other fields in the smatch:what variable seem invalid.
regex_search( altDow.substr(32), what, rex)
regex_search( altDow.substr(45), what, rex)

Regardless, I'm unclear how to tell which of the "groups" of
subexpressions was "hit". With boost::regex there was a way to loop
thru the boolean field:
what[i].matched
and see what "hit", but I am not seeing a comparable accessor.

Am I doing something wrong, leaving something out, completely clueless
on using xpressive, or is xpressive not intended for this kind of
usage?

(and if xpressive is appropriate, what statements will provide the
intended information?)

The intention is to be able to to replace/encode the altDay string to
be:
Alternate days of the week are <dow=2>Tue</dow>
 and <dow=4>Thursday</dow>
 and <dow=6>Sat</dow>
 and <dow=1>Monday</dow>." );

If it will help, here is a link to the vc7.1 project.
http://cleanspeech.sf.net/misc/DayOfWeekTest.zip

</alert>


_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] Baffled about results of search

Eric Niebler
Lynn Allan wrote:
>
> The subsequent searches give valid information about the length and
> position, but other fields in the smatch:what variable seem invalid.
> regex_search( altDow.substr(32), what, rex)
> regex_search( altDow.substr(45), what, rex)

Watch out! altDow.substr(32) is returning a temporary string object.
After regex_search returns, the temporary is destroyed and the results
object is left holding iterators into a destroyed string.


> Regardless, I'm unclear how to tell which of the "groups" of
> subexpressions was "hit". With boost::regex there was a way to loop
> thru the boolean field:
> what[i].matched
> and see what "hit", but I am not seeing a comparable accessor.

Xpressive is no different that Boost.Regex in this regard.
what[i].matched is correct.

--
Eric Niebler
Boost Consulting
www.boost-consulting.com
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] Baffled about results of search

l_d_allan
Eric Niebler wrote:

> Lynn Allan wrote:
>>
>> The subsequent searches give valid information about the length and
>> position, but other fields in the smatch:what variable seem
>> invalid.
>> regex_search( altDow.substr(32), what, rex)
>> regex_search( altDow.substr(45), what, rex)
>
> Watch out! altDow.substr(32) is returning a temporary string object.
> After regex_search returns, the temporary is destroyed and the
> results
> object is left holding iterators into a destroyed string.

Arggggg. Trying to get "up to speed" with Boost is probably a flawed
thing to try to do at the same time as learning rudimentary aspects of
STL. "We're not in MFC-land any more, Toto." <g>

std::string altDow32 = altDow.substr(32);
if( regex_search( altDow32, what, rex ) )  { unpack }

The above (ugly) code clears up the "self inflicted wound". I'm
"holding my nose" and using it to "unpack" the contents of the
match_results variables. Thanks for the clarification.

>> Regardless, I'm unclear how to tell which of the "groups" of
>> subexpressions was "hit". With boost::regex there was a way to loop
>> thru the boolean field:
>> what[i].matched
>> and see what "hit", but I am not seeing a comparable accessor.
>
> Xpressive is no different that Boost.Regex in this regard.
> what[i].matched is correct.

Works well. I think an earlier "trial and error stab at it" used
what[i].matched() and back-tracked too far when the compiler
complained. Your documentation is superior, but I've got a long ways
to go to understand template notation adequately.

Thanks again for your patience ....



_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] Baffled about results of search

Eric Niebler

Lynn Allan wrote:

> Eric Niebler wrote:
>
>>Lynn Allan wrote:
>>
>>>The subsequent searches give valid information about the length and
>>>position, but other fields in the smatch:what variable seem
>>>invalid.
>>>regex_search( altDow.substr(32), what, rex)
>>>regex_search( altDow.substr(45), what, rex)
>>
>>Watch out! altDow.substr(32) is returning a temporary string object.
>>After regex_search returns, the temporary is destroyed and the
>>results
>>object is left holding iterators into a destroyed string.
>
>
> Arggggg. Trying to get "up to speed" with Boost is probably a flawed
> thing to try to do at the same time as learning rudimentary aspects of
> STL. "We're not in MFC-land any more, Toto." <g>
>
> std::string altDow32 = altDow.substr(32);
> if( regex_search( altDow32, what, rex ) )  { unpack }


A better approach would be to use iterators.

// Begin searching 32 characters into altDow
regex_search( altDow.begin() + 32, altDow.end(), what, rex )

Note #1: the match results will now contain iterators into altDow, but
position offsets will be relative to the start of the search, not the
start of altDow.

Note #2: this code is only valid if altDow has at least 32 characters.
Error checking is left as an exercise. :-)

Incidentally, your earlier mistake was distressingly easy to make. I can
think of a way I might be able to catch this mistake at compile time in
a future versions of xpressive.

--
Eric Niebler
Boost Consulting
www.boost-consulting.com
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] Baffled about results of search

l_d_allan
> A better approach would be to use iterators.
>
> // Begin searching 32 characters into altDow
> regex_search( altDow.begin() + 32, altDow.end(), what, rex )
>
> Note #1: the match results will now contain iterators into altDow,
> but
> position offsets will be relative to the start of the search, not
> the
> start of altDow.

Is the following getting closer to "best practice" to use
xpressive-static for the kinds of specialized searches I'm attempting.
When I'm further along on the learning curve, I'll try to prepare some
"apples and apples" timing comparisons between boost::regex,
xpressive, spirit, a hand-tuned state-machine searcher, and an
automatic FSM generator utility.

I think I'm conforming to the guidelines on this page:
X:\DevTools\Boost\libs\xpressive\doc\html\boost_xpressive\user_s_guide\tips_n_tricks.html
except I don't understand how to specify
syntax_option_type::optimize
(does it apply to xpressive-static, or only xpressive-dynamic)?

boost::xpressive::regex_constants::syntax_option_type flags =
                     boost::xpressive::regex_constants::optimize;
sregex rexStatic =
    (s1 = (as_xpr("Sunday") | "Sun")) |
    (s2 = (as_xpr("Monday") | "Mon")) |
    (s3 = (as_xpr("Tuesday") | "Tue")) |
    (s4 = (as_xpr("Wednesday") | "Wed")) |
    (s5 = (as_xpr("Thursday") | "Thu")) |
    (s6 = (as_xpr("Friday") | "Fri")) |
    (s7 = (as_xpr("Saturday") | "Sat")) ;

//??? rexStatic.compile(rexStatic, flags);
smatch what;

std::string testStr =
    "Alternate days of the week are Tue and Thursday and Sat and
Monday. "
    "And then Monday and Wed and Friday and Sun. ";
std::string::const_iterator start = testStr.begin();
std::string::const_iterator finish = testStr.end();
int foundCount = 0;
while (regex_search(start, finish, what, rexStatic)) {
    foundCount++;
    size_t limit = what.size();
    size_t matchIndex = 1;
    while (matchIndex <= limit) {
        if (what[matchIndex].matched == true) {
            break;
        }
        ++matchIndex;
    }
#ifdef _DEBUG
    cout << "FoundCount: " << foundCount
            << " Index: " << static_cast<int>(matchIndex)
            << " what[0]: " << what[0] << endl;
#endif
    start = what[0].second;
}


_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] Baffled about results of search

Eric Niebler

Lynn Allan wrote:

> Is the following getting closer to "best practice" to use
> xpressive-static for the kinds of specialized searches I'm attempting.
> When I'm further along on the learning curve, I'll try to prepare some
> "apples and apples" timing comparisons between boost::regex,
> xpressive, spirit, a hand-tuned state-machine searcher, and an
> automatic FSM generator utility.
>
> I think I'm conforming to the guidelines on this page:
> X:\DevTools\Boost\libs\xpressive\doc\html\boost_xpressive\user_s_guide\tips_n_tricks.html
> except I don't understand how to specify
> syntax_option_type::optimize
> (does it apply to xpressive-static, or only xpressive-dynamic)?


Static regexes assume the optimize flag.


> boost::xpressive::regex_constants::syntax_option_type flags =
>                      boost::xpressive::regex_constants::optimize;
> sregex rexStatic =
>     (s1 = (as_xpr("Sunday") | "Sun")) |
>     (s2 = (as_xpr("Monday") | "Mon")) |
>     (s3 = (as_xpr("Tuesday") | "Tue")) |
>     (s4 = (as_xpr("Wednesday") | "Wed")) |
>     (s5 = (as_xpr("Thursday") | "Thu")) |
>     (s6 = (as_xpr("Friday") | "Fri")) |
>     (s7 = (as_xpr("Saturday") | "Sat")) ;


Yep, you got it.


> //??? rexStatic.compile(rexStatic, flags);
> smatch what;
>
> std::string testStr =
>     "Alternate days of the week are Tue and Thursday and Sat and
> Monday. "
>     "And then Monday and Wed and Friday and Sun. ";
> std::string::const_iterator start = testStr.begin();
> std::string::const_iterator finish = testStr.end();
> int foundCount = 0;
> while (regex_search(start, finish, what, rexStatic)) {
>     foundCount++;
>     size_t limit = what.size();
>     size_t matchIndex = 1;
>     while (matchIndex <= limit) {
>         if (what[matchIndex].matched == true) {
>             break;
>         }
>         ++matchIndex;
>     }
> #ifdef _DEBUG
>     cout << "FoundCount: " << foundCount
>             << " Index: " << static_cast<int>(matchIndex)
>             << " what[0]: " << what[0] << endl;
> #endif
>     start = what[0].second;
> }


Ah, you're trying to find all the matches. Check out regex_iterator.

sregex_iterator begin(testStr.begin(), testStr.end(), rexStatic), end;
for(; begin != end; ++begin, ++foundCount) {
     smatch const &what = *begin;
     size_t limit = what.size();
     size_t matchIndex = 1;
     for(; matchIndex <= limit; ++matchIndex)
         if(what[matchIndex].matched == true)
             break;
#ifdef _DEBUG
     cout << "FoundCount: " << foundCount
          << " Index: " << static_cast<int>(matchIndex)
          << " what[0]: " << what[0] << endl;
#endif
}

HTH,

--
Eric Niebler
Boost Consulting
www.boost-consulting.com
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] Some timings [was: Baffled about results of search]

l_d_allan
FWIW, here are some timings for different "flavors" of xpressive, and
some other boost and non-boost search algo's. YMMW.

Caveats:
* contrived data (BMH probably doesn't help)
   std::string testStr =
       "Alternate days of the week are Tue and Thursday and Sat and
Monday. "
       "And then Monday and Wed and Friday and Sun. "
       "Abbrev days in alpha order are Fri Mon Sat Sun Thu Tue Wed "
       "Full days in alpha order are Friday Monday Saturday Sunday
Thursday Tuesday Wednesday "
       "Abbrev days in reverse alpha order are Wed Tue Thu Sun Sat Mon
Fri "
       "Full days in reverse alpha order are Wednesday Tuesday
Thursday Sunday Saturday Monday Friday "
       "Embedded abbrev days in reverse alpha order are 1Wed 2Tue 3Thu
4Sun 5Sat 6Mon 7Fri "
       "Embedded Full days in alpha order are 1Friday 2Monday
3Saturday 4Sunday 5Thursday 6Tuesday 7Wednesday "
       "Abbrev dot days in alpha order are Fri. Mon. Sat. Sun. Thu.
Tue. Wed. "
       "Abbrev dot days in reverse alpha order are Wed. Tue. Thu. Sun.
Sat. Mon. Fri. "
       "Near misses are Wed.X Tue.X Thu.X Sun.X Sat.X Mon.X Fri.X "
       "Near misses are WednesDay TuesDay ThursDay SunDay SaturDay
MonDay FriDay "
       "Near misses are WeD TuE ThU SuN SaT MoN FrI "
    ;
* not real confident of the numbers .... seem reasonable but ...

Boost::xpressive-static: Elapsed for 10000 loops Ms: 422.72
Boost::xpressive-static: Elapsed for 100000 loops Ms: 4145.9
Boost::xpressive-dynamic-not-optimized: Elapsed for 10000 loops Ms:
429.847
Boost::xpressive-dynamic-not-optimized: Elapsed for 100000 loops Ms:
4392.54
Boost::xpressive-dynamic-optimized: Elapsed for 10000 loops Ms:
435.617
Boost::xpressive-dynamic-optimized: Elapsed for 100000 loops Ms:
4282.6
Boost::xpressive-static-iterator: Elapsed for 10000 loops Ms: 448.425
Boost::xpressive-static-iterator: Elapsed for 100000 loops Ms: 4379.07

Boost::regex: Elapsed for 10000 loops Ms: 622.377
Boost::regex: Elapsed for 100000 loops Ms: 5965.24
(can probably be tweaked and improved .... just q/d from example)

FSM generator Elapsed for 10000 loops Ms: 2487.97
FSM generator Elapsed for 100000 loops Ms: 24870.5

Hand-tuned parser: Elapsed for 10000 loops Ms: 110.583
Hand-tuned parser: Elapsed for 100000 loops Ms: 1107.81

(no numbers for spirit .... barely started on its learning curve)


_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] Some timings [was: Baffled aboutresults of search]

John Maddock
> * not real confident of the numbers .... seem reasonable but ...

That's roughly what I'd expect: Regex and xpressive use very similar
algorithms internally, with Eric and I occationally playing a game of
leapfrog :-)  And probably we've both already grabbed the low hanging fruit,
unless Eric has any cunning plans he's keeping to himself :-)  Performance
between the two depends upon the exact expression used, and the data
searched, but they probably should be within a factor of 2 of each other.
Historically Regex tended to do better with sparse matches in big files, and
xpressive better with short dense matches, but I haven't tested a really
recent version of xpressive, so as ever your mileage may vary.

John.

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] Some timings [was: Baffled about results of search]

Stuart Dootson
In reply to this post by l_d_allan
On 4/13/06, Lynn Allan <[hidden email]> wrote:
> FWIW, here are some timings for different "flavors" of xpressive, and
> some other boost and non-boost search algo's. YMMW.
>

<snip>

Lynn - it might be interesting to use re2c (http://re2c.org/) to
generate regex machines and see what their performance is like - after
all, they do clain 're2c generates warning free code that is equal to
hand-written code in terms of size, speed and quality.' (I'm offering
no opinion on the veracity of that statement....), so (allegedly)
they'd give a reasonable hand-code analogue....

Stuart Dootson
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] Some timings [was: Baffled about results of search]

Eric Niebler
Stuart Dootson wrote:

> On 4/13/06, Lynn Allan <[hidden email]> wrote:
>> FWIW, here are some timings for different "flavors" of xpressive, and
>> some other boost and non-boost search algo's. YMMW.
>>
>
> Lynn - it might be interesting to use re2c (http://re2c.org/) to
> generate regex machines and see what their performance is like - after
> all, they do clain 're2c generates warning free code that is equal to
> hand-written code in terms of size, speed and quality.' (I'm offering
> no opinion on the veracity of that statement....), so (allegedly)
> they'd give a reasonable hand-code analogue....

I don't think re2c does backreferences, so it might not be usable for
Lynn's purposes. It also means any comparison would be apples to
oranges, since regex matching with backreferences is a much harder problem.

--
Eric Niebler
Boost Consulting
www.boost-consulting.com
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Xpressive] Some timings [was: Baffled aboutresults of search]

l_d_allan
> I don't think re2c does backreferences, so it might not be usable
> for
> Lynn's purposes. It also means any comparison would be apples to
> oranges, since regex matching with backreferences is a much harder
> problem.

Some of this is simply curiousity on my part. (... and too much time
on my hands?)

Also, I've been quite impressed by the performance of xpressive and
regex (haven't looked much at spirit) ... my initial expectation was
that something general purpose and powerful would be worse than an
order of magnitude slower than "rolling your own" hand-tuned
recognizer.

I think it might be valuable for a person in the position of
evaluating whether to put in the time to learn boost::regex and/or
boost::xpressive could find information on the performance trade-offs
involved:
"So, how slow is it?"
(and then, "How bloated is it?" "How hard is it to learn?" "How hard
is it to build and integrate?")

Epecially that it probably won't be nearly as negative as a first
guess might be. (but of course, "it depends" applies to everything ).

BTW, my first impression is that re2c might be useful for some of the
things I work on .... it could get the code to the "area of interest",
where more elaborate replacements could be done. I don't think I
understand "backreferences" ... my mind hasn't absorbed past DFA's and
"greedy" always seems to be a liability.

Another BTW: my impression is that pcre doesn't do substitutions ....
based on an email from its author, Dr. Philip Hazel. Did I
misunderstand? I didn't see anything in its api to indicate
substitutions were possible.



_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users