[xpressive] use of Boyer-Moore

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[xpressive] use of Boyer-Moore

l_d_allan
Eric N. wrote during 2005
> do to make Boyer-Moore work with the regex traits interface make
> Boyer-Moore more trouble than its worth for case-insensitive matches. I
> haven't yet run the other tests.
 
Just curious ... does xpressive use "standard" Boyer-Moore or a variant? In my limited experience, the Horspool variant of Boyer-Moore is simpler and tends to be faster. It may be more or less prone to have problems with "pathological" cases than "standard" Boyer-Moore, however.

 

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [xpressive] use of Boyer-Moore

Eric Niebler
Lynn Allan wrote:
> Eric N. wrote during 2005
>  > do to make Boyer-Moore work with the regex traits interface make
>  > Boyer-Moore more trouble than its worth for case-insensitive matches. I
>  > haven't yet run the other tests.
>  
> Just curious ... does xpressive use "standard" Boyer-Moore or a variant?
> In my limited experience, the Horspool variant of Boyer-Moore is simpler
> and tends to be faster. It may be more or less prone to have problems
> with "pathological" cases than "standard" Boyer-Moore, however.


Xpressive uses a home-grown variant of Horspool. It's an extension to
the algorithm that allows it to perform case-insensitive matches. It
also works with large character sets, such as Unicode, which standard
Horspool does not, IIUC. It yields potential matches, not exact matches.
Only if it finds a tentative match is a full regex match attepted.

Since I wrote the above, I have fixed the performance problem with BMH
and case-insensitive matches by extended the regex traits class with a
function that returns all the case-folded equivalents of a character.
This resulted in a significant performance improvement for
case-insensitive matches.

HTH,

--
Eric Niebler
Boost Consulting
www.boost-consulting.com
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [xpressive] use of Boyer-Moore

Sebastian Redl
Eric Niebler wrote:

>Since I wrote the above, I have fixed the performance problem with BMH
>and case-insensitive matches by extended the regex traits class with a
>function that returns all the case-folded equivalents of a character.
>This resulted in a significant performance improvement for
>case-insensitive matches.
>  
>
How does that work with multiple character case mappings, like the
German ß -> SS (the sharp s does not exist in upper case)?

Sebastian Redl
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [xpressive] use of Boyer-Moore

Eric Niebler
Sebastian Redl wrote:

> Eric Niebler wrote:
>
>> Since I wrote the above, I have fixed the performance problem with BMH
>> and case-insensitive matches by extended the regex traits class with a
>> function that returns all the case-folded equivalents of a character.
>> This resulted in a significant performance improvement for
>> case-insensitive matches.
>>  
>>
> How does that work with multiple character case mappings, like the
> German ß -> SS (the sharp s does not exist in upper case)?

It doesn't. :-P Xpressive aims for "Basic Unicode Support," as defined
by Unicode TR18 (http://www.unicode.org/reports/tr18/):

     Some caseless matches may match one character against two:
     for example, U+00DF "ß" matches the two characters "SS".
     And case matching may vary by locale. However, because many
     implementations are not set up to handle this, at Level 1
     only simple case matches are necessary.

So correct handling of German ß -> SS is only necessary for "Extended
Unicode Support," which would be nice but is a more distant goal. Sadly
and AFAICT, TR1.Regex doesn't even make accommodation for Basic Unicode
Support, since it doesn't provide syntax for character set subtraction
and intersection.

In short, it's a problem, but there are bigger fish to fry. If you need
a regex engine that can handle this today, try ICU.

--
Eric Niebler
Boost Consulting
www.boost-consulting.com
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users