Re: [regex] Working with wchar_t on older UNIXplatforms

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [regex] Working with wchar_t on older UNIXplatforms

Andrei Tarassov
John,

Thanks for help! I've managed to create my own traits classes and even
made the whole stuff compile, but I found that it would not work :-)

Right now I am doing intermediate encoding/decoding between wchar_t and
the local encoding (which is determined by the locale). However, I do
not like that approach much.

I am intrigued with what you said about converting data from UTF-8 to
UTF-32 on the fly. It is absolutely not a problem to convert my Unicode
strings to UTF-8 encoded strings. Where could I read about those on the
fly conversions and what limitations do they have (e.g. how locale
settings are handled)?

Thanks,
Andrei

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of John Maddock
Sent: Tuesday, March 21, 2006 12:36
To: [hidden email]
Subject: Re: [Boost-users] [regex] Working with wchar_t on older
UNIXplatforms

> Now I tried to integrate wregex in the software, but it just would
> not compile complaining about missing wstring (and defined
> BOOST_NO_WREGEX). I tried to make up my own regex character traits
> class, but this does not seem to help, because some other
> classes/types (such as sub_match) make use of basic_string<charT>.
>
> Is there any way to bypass the problem?

OK all the following comments apply to 1.33.1.

There are two easy options and one harder option:

Easy option #1, use STLport if it supports wstring.
Easy option #2, use the ICU/Unicode support in 1.33.1 to search your
data
directly (as long as it's in UTF-8, UTF-16 or UTF-32 format).  You'll
get
back iterators into your data (whatever encoding it's in), so there's no

problems determining offsets etc.

The slightly harder option, as you've guessed already: write your own
traits
class, from 1.33 onwards you can use vector<charT> in place of
basic_string<charT> in the traits class.  If you take a look at the
traits
class used by the Unicode/ICU support code it should give you the
general
idea, and there are docs here:
http://www.boost.org/libs/regex/doc/concepts.html#traits

And finally... if you data is in MBCS format you might get some ideas
from
the unicode suuport code in 1.33.x: basically in order to handle
multibyte
encodings it converts from UTF-8 or UTF-16 to UTF-32 code points on the
fly.
Of course this requires that the on-the-fly conversions are
bidirectional,
this works OK for Unicode, but I'm not sure about how far you would get
with
other encodings.

HTH, John.

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [regex] Working with wchar_t on olderUNIXplatforms

John Maddock
> I am intrigued with what you said about converting data from UTF-8 to
> UTF-32 on the fly. It is absolutely not a problem to convert my
> Unicode strings to UTF-8 encoded strings. Where could I read about
> those on the fly conversions and what limitations do they have (e.g.
> how locale settings are handled)?

What locale settings?  UTF-8 is mostly locale-independent (as an encoding),
the only locale specific code is in the traits class to handle collation:
and it only sees UTF-32 code points.  The on-the-fly conversions are
performed by iterator adapters in boost/regex/pending/unicode_iterator.hpp
and the docs for the Unicode aware code is here:
http://www.boost.org/libs/regex/doc/icu_strings.html

John.

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [regex] Working with wchar_t on olderUNIXplatforms

Andrei Tarassov
As I understand using this feature requires ICU. Unfortunately, this is
not an option for us :-(

Andrei

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of John Maddock
Sent: Tuesday, March 21, 2006 20:54
To: [hidden email]
Subject: Re: [Boost-users] [regex] Working with wchar_t on
olderUNIXplatforms

> I am intrigued with what you said about converting data from UTF-8 to
> UTF-32 on the fly. It is absolutely not a problem to convert my
> Unicode strings to UTF-8 encoded strings. Where could I read about
> those on the fly conversions and what limitations do they have (e.g.
> how locale settings are handled)?

What locale settings?  UTF-8 is mostly locale-independent (as an
encoding),
the only locale specific code is in the traits class to handle
collation:
and it only sees UTF-32 code points.  The on-the-fly conversions are
performed by iterator adapters in
boost/regex/pending/unicode_iterator.hpp
and the docs for the Unicode aware code is here:
http://www.boost.org/libs/regex/doc/icu_strings.html

John.

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [regex] Working with wchar_t on olderUNIXplatforms

John Maddock
> As I understand using this feature requires ICU. Unfortunately, this
> is not an option for us :-(

Yes, sorry, understood.

You're probably back to writing a traits class then, it's honestly not that
hard :-)  I sugest you take c_regex_traits as a starting point, change
basic_string to vector and work from there.

John.

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users