regex_search? Which subexpression smatch'ed?

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

regex_search? Which subexpression smatch'ed?

l_d_allan
<alert comment="boost and regex newbie">

I'm using regex_search to detect which subexpression matched. I adapted the "Captures" example at:
http://www.boost.org/libs/regex/doc/captures.html

In the snippet below, it works ok to loop thru m[num].matched, but I was wondering if there was a more direct way of getting the information about which match was "hit". (The real application is going to have about 70 possible subexpressions to match against, instead of just four.)
 
std::string contents("String has two, three, one, and four in it. ");
boost::regex reg("(one)|(two)|(three)|(four)");
 
From the above, I want the output to be:
<match=2>two
<match=3>three
<match=1>one
<match=4>four

void test_search()
{
  boost::smatch m;
  std::string contents("String has two, three, one, and four in it. ");
  boost::regex reg("(one)|(two)|(three)|(four)");
  std::string::const_iterator it = contents.begin();
  std::string::const_iterator end = contents.end();
  while (boost::regex_search(it, end, m, reg)) {
    for (int num = 1; num <= 4; ++num) {
      if (m[num].matched == true) {
        std::cout << "<match=" << num << ">" << m[0] << std::endl;
        break;
      }
    }
    it = m[0].second;
  }
}

</alert>

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: regex_search? Which subexpression smatch'ed?

John Maddock
> In the snippet below, it works ok to loop thru m[num].matched, but I
> was wondering if there was a more direct way of getting the
> information about which match was "hit". (The real application is
> going to have about 70 possible subexpressions to match against,
> instead of just four.)

70!!!  It wasn't really designed that kind of useage, and there isn't a
simple way to get say the lowest index sub-expression that was matched: even
with a full knowledge of the implementation it's not clear to me that it's
possible to do better than an O(N) algorithm.

You could I suppose group the sub-expressions together so you could use
something more like a binary search to find the one that matched:

((a|b|c)|(d|e|f))

Then

if(my_match[1].matched)
{
// must be a, b or c
}
else
{
// must be d e or f.
}

Sorry I can't be more helpful, but it's tempting to suggest that you use the
spirit parser libary rather than regexes.

John.

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: regex_search? Which subexpression smatch'ed?

l_d_allan

"John Maddock" <[hidden email]> wrote in message
news:010801c6595d$74466980$901c0952@fuji...

>> In the snippet below, it works ok to loop thru m[num].matched, but I
>> was wondering if there was a more direct way of getting the
>> information about which match was "hit". (The real application is
>> going to have about 70 possible subexpressions to match against,
>> instead of just four.)
>
> 70!!!  It wasn't really designed that kind of useage, and there isn't a
> simple way to get say the lowest index sub-expression that was matched:
> even
> with a full knowledge of the implementation it's not clear to me that it's
> possible to do better than an O(N) algorithm.

Thanks for the reply.

It's actually a lot more than 70 total sub-expressions, but 66 "groups" ....
the application is recognizing Bible verse references and emitting a code.

Genesis 1:1 becomes <xb=1.1.1>Genesis 1:1</xb> and
Gen 1 becomes <xb=1.1.1>Gen 1</xb>
The chapter number needs to be specified, but the verse number is optional
(after the colon).

Here is the actual MOAR (mother of all regexes):

((Gen|Ge|Genesis)|
(Exo|Ex|Exodus|Exod)|
(Le|Leviticus|Lev)|
(Nu|Num|Numbers)|
(De|Deuteronomy|Dt|Deut)|
(Jos|Josh|Joshua)|
(Jdg|Judges|Judg)|
(Ru|Ruth)|
(1 Sa|1Sa|1 Samuel|1 Sam|1Sam|1 Kingdoms|I Samuel|First Samuel|1st Samuel)|
(2 Sa|2Sa|2 Samuel|2 Sam|2Sam|2 Kingdoms|II Samuel|Second Samuel|2nd
Samuel)|
(1 Ki|1Ki|1 Kings|1 Kgs|1 Kin|1 Ki|1Kings|3 Kingdoms|I Kings|First Kings|1st
Kings)|
(2 Ki|2Ki|2 Kings|2 Kgs|2 Kin|2 Kgs|II Kings|IIKings|2Kings|Second Kings|2nd
Kings|4 Kingdoms)|
(1 Ch|1Ch|1 Chronicles|1 Chr|1 Chron|1Chron|I Chronicles|IChronicles|First
Chronicles|1st Chronicles)|
(2 Ch|2Ch|2 Chronicles|2 Chr|2 Chron|2Chron|II
Chronicles|IIChronicles|Second Chronicles|2nd Chronicles)|
(Ezr|Ezra)|
(Ne|Neh|Nehemiah)|
(Es|Esther|Esth)|
(Jb|Job)|
(Ps|Psalm|Psa|Psalms)|
(Pr|Proverbs|Pro|Prov)|
(Ec|Ecc|Ecclesiastes|Eccl|Eccles|Ecclesiastics)|
(Cant|Song|Song of Songs|Song of Solomon|Song of Sol)|
(Is|Isaiah|Isa)|
(Je|Jeremiah|Jer)|
(La|Lamentations|Lam)|
(Ez|Ezekiel|Ezk|Ezek)|
(Da|Daniel|Dan)|
(Ho|Hosea|Hs|Hos)|
(Jl|Joel)|
(Am|Amos)|
(Ob|Obadiah|Obad)|
(Jnh|Jonah)|
(Mi|Micah|Mic)|
(Na|Nahum|Nah)|
(Hab|Habakkuk)|
(Zep|Zephaniah|Zeph)|
(Hag|Haggai)|
(Zech|Zec|Zechariah)|
(Mal|Malachi)|
(Mat|Matthew|Matt|Mt)|
(Mar|Mark|Mk)|
(Lk|Luke|Luk)|
(Jn|John|Joh)|
(Ac|Acts|Act)|
(Rm|Romans|Rom|Roman)|
(1Cor|1 Co|1Co|1 Corinthians|1 Corin|1Corin|1 Cor|I Corinthians|I.
Corinthians|First Corinthians|1st Corinthians)|
(2Cor|2 Co|2Co|2 Corinthians|2 Corin|2Corin|2 Cor|II Corinthians|II.
Corinthians|Second Corinthians|2nd Corinthians)|
(Ga|Galatians|Gal)|
(Ep|Ephesians|Eph)|
(Phil|Philippians|Php|Philipians|Phillipians)|
(Col|Colossians)|
(1 Th|1Th|1 Thessalonians|1Thessalonians|1 Thess|1 Thes|1Thess|1Thes|I
Thessalonians|First Thessalonians|1st Thessalonians)|
(2 Th|2Th|2 Thessalonians|2Thessalonians|2 Thess|2 Thes|2Thess|2Thes|II
Thessalonians|Second Thessalonians|2nd Thessalonians)|
(1 Ti|1Ti|1 Timothy|1Timothy|1Tim|1 Tim|I Timothy|First Timothy|1st
Timothy)|
(2 Ti|2Ti|2 Timothy|2Timothy|2Tim|2 Tim|II Timothy|Second Timothy|2nd
Timothy)|
(Ti|Titus|Tit)|
(PHILEMON|Phile|Philemon|Phlm|Philm)|
(Hb|Hebrews|Heb|Hebrew)|
(Ja|James|Jas|Jms)|
(1 Pe|1Pe|1 Peter|1 Pet|I Peter|IPeter|First Peter|1st Peter)|
(2 Pe|2Pe|2 Peter|2 Pet|II Peter|IIPeter|Second Peter|2nd Peter)|
(1 Jn|1Jn|1 John|1 Joh|1Joh|1John|I John|IJohn|First John|1st John)|
(2 Jn|2Jn|2 John|2 Joh|2Joh|2John|II John|IIJohn|Second John|2nd John)|
(3 Jn|3Jn|3 John|3 Joh|3Joh|3John|III John|IIIJohn|Third John|3rd John)|
(Jde|Jude)|
(Apoc|Revelation|Rev|Revel|Apocrypha|Revelation of John|Rev of John|Rev. of
John))
\.? ?([0-9]{1,3})(:([0-9]{1,3}))?

> You could I suppose group the sub-expressions together so you could use
> something more like a binary search to find the one that matched:
>
> ((a|b|c)|(d|e|f))
>
> Then
>
> if(my_match[1].matched)
> {
> // must be a, b or c
> }
> else
> {
> // must be d e or f.
> }

My impression is that I'm already doing something like you suggest. The code
doesn't need to know which of the three variants of Genesis was matched,
just that it was from the Genesis "group". The adaption of the "Captures"
code works well. Thanks for providing it.

Obviously easier said than done, but if the library can turn the appropriate
m[index].matched to "true", then perhaps there could be another variable in
the class to reflect the lowest index of the first group that is true.  I
realize this gets more complicated because there can be multiple groups.

> Sorry I can't be more helpful, but it's tempting to suggest that you use
> the
> spirit parser libary rather than regexes.

I'll take a look at it .... I'm really impressed with your boost::regex
library, and am inclined to continue using it. Thanks for making it
available.


_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: regex_search? Which subexpression smatch'ed?

John Maddock
> Obviously easier said than done, but if the library can turn the
> appropriate m[index].matched to "true", then perhaps there could be
> another variable in the class to reflect the lowest index of the
> first group that is true.  I realize this gets more complicated
> because there can be multiple groups.

It's backtracking that gets you: each time the "lowest sub-expression set"
gets updated you'd need to store a complete history of previous settings so
you can backtrack out again.  However, now I think about that some more, I
actually think it might be possible after all :-)

John.

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: regex_search? Which subexpression smatch'ed?

l_d_allan
Whatever you could work out would be greatly appreciated.

----- Original Message -----
From: "John Maddock" <[hidden email]>
> It's backtracking that gets you: each time the "lowest sub-expression set"
> gets updated you'd need to store a complete history of previous settings
> so
> you can backtrack out again.  However, now I think about that some more, I
> actually think it might be possible after all :-)


_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

[Regex] Simpler load_file code to use in snippet examples?

l_d_allan
In reply to this post by l_d_allan
Thanks for providing and maintaining regex.

The following code is used in a number of the example snippets:

#void load_file(std::string& s, std::istream& is)
#{
#   s.erase();
#   if(is.bad()) return;
#   s.reserve(is.rdbuf()->in_avail());
#   char c;
#   while(is.get(c))
#   {
#      if(s.capacity() == s.size())
#         s.reserve(s.capacity() * 3);
#      s.append(1, c);
#   }
#}

My impression is that this code is used because some compilers won't accept
a simpler and more direct approach.

I was wondering if the following would work "cross-platform" (It seems to
work ok with the Microsoft Visual Studio vc7.1 compiler, but I am ignorant
of other compilers):

#void load_file(std::string& s, std::istream& is)
#{
#   if(is.bad()) return;
#   // Use 0 as CrLf delimiter to cause entire file to be read
#   getline(is, s, '\0');
#}



_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Regex] Simpler load_file code to use in snippet examples?

Rupert Swarbrick
Lynn Allan wrote:

> I was wondering if the following would work "cross-platform" (It seems to
> work ok with the Microsoft Visual Studio vc7.1 compiler, but I am ignorant
> of other compilers):
>
> #void load_file(std::string& s, std::istream& is)
> #{
> #   if(is.bad()) return;
> #   // Use 0 as CrLf delimiter to cause entire file to be read
> #   getline(is, s, '\0');
> #}
Well, there's an immediate problem if the file contains any NULL bytes,
which most non-text files do - the code will instantly stop on them.
There may be a faster way using a getline() loop, based on the fact that
many bytes are not \0, but I'm not sure that you'd actually lose much
overhead, given that the stream is likely to be buffered?

I think that the code given is probably the simplest that is actually
guaranteed to work.

Please tell me if I'm being an idiot of course!

Rupert

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Regex] Simpler load_file code to use in snippetexamples?

l_d_allan
> Lynn Allan wrote:
>> #void load_file(std::string& s, std::istream& is)
>> #{
>> #   if(is.bad()) return;
>> #   // Use 0 as CrLf delimiter to cause entire file to be read
>> #   getline(is, s, '\0');
>> #}

> Well, there's an immediate problem if the file contains any NULL bytes,
> which most non-text files do - the code will instantly stop on them.
> There may be a faster way using a getline() loop, based on the fact that
> many bytes are not \0, but I'm not sure that you'd actually lose much
> overhead, given that the stream is likely to be buffered?

Granted, the proposed code only works with non-binary data that doesn't have
embedded NULL bytes. That seems a reasonable assumption for the snippets
provided, as well as most situations where boost::regex would be used.

And o/s buffering/caching may make it a moot point.


_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Regex] Simpler load_file code to use in snippetexamples?

John Maddock
In reply to this post by l_d_allan
> My impression is that this code is used because some compilers won't
> accept a simpler and more direct approach.
>
> I was wondering if the following would work "cross-platform" (It
> seems to work ok with the Microsoft Visual Studio vc7.1 compiler, but
> I am ignorant of other compilers):
>
> #void load_file(std::string& s, std::istream& is)
> #{
> #   if(is.bad()) return;
> #   // Use 0 as CrLf delimiter to cause entire file to be read
> #   getline(is, s, '\0');
> #}

Looks like a good idea.

However one of the main reasons for code being as it is, are rather
sub-optimal getline implementations that append one character at a time to
the string, without reserving any space first: this results in an O(N^2)
algorithm for many string implementations.

Previously I used istream iterators and std::copy to copy the stream's
contents to the string.  It's lovely simple code, and good use of the std
lib, but it takes an age to run :-(

Basically there are really simple solutions that work well for small files,
and then crap out big time as soon as the file size grows.  Of course you
could always replace std::string with std::vector and get O(N log N)
performance, but then you're still open to "why did you do that" questions.

John.

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: regex_search? Which subexpression smatch'ed?

Eric Niebler
In reply to this post by l_d_allan
Lynn Allan wrote:
> Whatever you could work out would be greatly appreciated.
>
> ----- Original Message -----
> From: "John Maddock" <[hidden email]>
>> It's backtracking that gets you: each time the "lowest sub-expression set"
>> gets updated you'd need to store a complete history of previous settings
>> so
>> you can backtrack out again.  However, now I think about that some more, I
>> actually think it might be possible after all :-)


Forgive me for jumping in late. This seems like a pretty silly
optimization to me. Matching a regex, especially a complicated one like
this, is going to take soooo much longer than a simple O(N) traversal of
the resulting sub-matches that it just doesn't make sense to track this
information during match time. In fact, it'll probably run *slower*.
Don't do during match time what you can do before or after. You'll turn
a simple O(N) operation into something far worse, and you'll make
everybody pay for it.

If you don't want to muddy up user code with a bunch of
"if(results[n].matched)" checks, just write an iterator adaptor (using
boost::filter_iterator<>) that iterates over only those sub-matches
which matched, and also provides the index.

My $.02,

--
Eric Niebler
Boost Consulting
www.boost-consulting.com
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: regex_search? Which subexpression smatch'ed?

John Maddock
> Forgive me for jumping in late. This seems like a pretty silly
> optimization to me. Matching a regex, especially a complicated one
> like this, is going to take soooo much longer than a simple O(N)
> traversal of the resulting sub-matches that it just doesn't make
> sense to track this information during match time. In fact, it'll
> probably run *slower*. Don't do during match time what you can do
> before or after. You'll turn a simple O(N) operation into something
> far worse, and you'll make everybody pay for it.
>
> If you don't want to muddy up user code with a bunch of
> "if(results[n].matched)" checks, just write an iterator adaptor (using
> boost::filter_iterator<>) that iterates over only those sub-matches
> which matched, and also provides the index.
>
> My $.02,

Good point: even enumerating 100 sub-expressions to find out which was
matched is going to be pretty quick compared to the alternatives.  The
balance only really changes if you have thousands of marked sub-expressions,
and I really hope we don't see regexes like that any time soon :-)

John.

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: regex_search? Which subexpression smatch'ed?

l_d_allan
Just seemed that at the point when the code sets mySmatch[curIndex].matched
to true, then it could "tuck away" that index in an accessible variable ....
but apparently not.

Any suggestions about alternative on how to best accomplish the search for
Bible verse references? I'd have to think the participants in this thread
would have some deep insights on the best way to proceed.

I've written different "flavors" of state-machines that recognize these kind
of patterns, but a regex seems like the way to go unless performance has to
be emphasized.

Thanks.

----- Original Message -----
From: "John Maddock" <[hidden email]>
Newsgroups: gmane.comp.lib.boost.user
Sent: Friday, April 07, 2006 11:56 AM
Subject: Re: regex_search? Which subexpression smatch'ed?


>> Forgive me for jumping in late. This seems like a pretty silly
>> optimization to me. Matching a regex, especially a complicated one
>> like this, is going to take soooo much longer than a simple O(N)
>> traversal of the resulting sub-matches that it just doesn't make
>> sense to track this information during match time. In fact, it'll
>> probably run *slower*. Don't do during match time what you can do
>> before or after. You'll turn a simple O(N) operation into something
>> far worse, and you'll make everybody pay for it.
>>
>> If you don't want to muddy up user code with a bunch of
>> "if(results[n].matched)" checks, just write an iterator adaptor (using
>> boost::filter_iterator<>) that iterates over only those sub-matches
>> which matched, and also provides the index.
>>
>> My $.02,
>
> Good point: even enumerating 100 sub-expressions to find out which was
> matched is going to be pretty quick compared to the alternatives.  The
> balance only really changes if you have thousands of marked
> sub-expressions,
> and I really hope we don't see regexes like that any time soon :-)
>
> John.


_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: regex_search? Which subexpression smatch'ed?

John Maddock
> Just seemed that at the point when the code sets
> mySmatch[curIndex].matched to true, then it could "tuck away" that
> index in an accessible variable .... but apparently not.

No it has to maintain a stack of all past values, in order to backtrack out
of a match.

> Any suggestions about alternative on how to best accomplish the
> search for Bible verse references? I'd have to think the participants
> in this thread would have some deep insights on the best way to
> proceed.

This is a deep insight free zone :-)

If you just want the first index to have matched then simply a linear search
is as good as anything:

int lowest_match = 1;
for(; lowest_match < mymatch.size(); ++lowest_match)
    if(mymatch[lowest_match].matched) break;

which assumes that your regex is along the lines of:

(a)|(b)|(c)|...

If that turns out too slow, then you might then have to collect together
common prefixes for some of the expressions into single alternatives, but
I'm not knowledgeable on bible references.

John.


_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users