[regex] Working with wchar_t on older UNIX platforms

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[regex] Working with wchar_t on older UNIX platforms

Andrei Tarassov
Hi!

I am experiencing a problem with integrating boost::regex into one of software projects. The problem is that the project supports a number of old operating systems (such as AIX 4.3.3 and HP-UX 11.00), where full support for functions working with wide characters (wchar_t) does not exist. This causes GCC to be compiled without wstring support.

On the other hand, we do use wchar_t to the extent allowed by these operating systems and find the limited support more or less enough.

Now I tried to integrate wregex in the software, but it just would not compile complaining about missing wstring (and defined BOOST_NO_WREGEX). I tried to make up my own regex character traits class, but this does not seem to help, because some other classes/types (such as sub_match) make use of basic_string<charT>.

Is there any way to bypass the problem?

I could be using the plain-char version of regex, but that causes me problems with determining the position of a match in the original wide-character string (conversion from wchar_t to char could involve some multibyte encoding).

Thanks,

--
ANDREI TARASSOV
Software Engineer III
Altiris OÜ
T >  +372 6507154
M >  +372 53403298
www.altiris.com

Security. Compliance. Patch management. IT service management.
Altiris solves your most pressing IT issues.
www.altiris.com

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [regex] Working with wchar_t on older UNIX platforms

John Maddock
> Now I tried to integrate wregex in the software, but it just would
> not compile complaining about missing wstring (and defined
> BOOST_NO_WREGEX). I tried to make up my own regex character traits
> class, but this does not seem to help, because some other
> classes/types (such as sub_match) make use of basic_string<charT>.
>
> Is there any way to bypass the problem?

OK all the following comments apply to 1.33.1.

There are two easy options and one harder option:

Easy option #1, use STLport if it supports wstring.
Easy option #2, use the ICU/Unicode support in 1.33.1 to search your data
directly (as long as it's in UTF-8, UTF-16 or UTF-32 format).  You'll get
back iterators into your data (whatever encoding it's in), so there's no
problems determining offsets etc.

The slightly harder option, as you've guessed already: write your own traits
class, from 1.33 onwards you can use vector<charT> in place of
basic_string<charT> in the traits class.  If you take a look at the traits
class used by the Unicode/ICU support code it should give you the general
idea, and there are docs here:
http://www.boost.org/libs/regex/doc/concepts.html#traits

And finally... if you data is in MBCS format you might get some ideas from
the unicode suuport code in 1.33.x: basically in order to handle multibyte
encodings it converts from UTF-8 or UTF-16 to UTF-32 code points on the fly.
Of course this requires that the on-the-fly conversions are bidirectional,
this works OK for Unicode, but I'm not sure about how far you would get with
other encodings.

HTH, John.

_______________________________________________
Boost-users mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/boost-users