[Spirit.Qi] How do I use UTF-8 encoding with Qi

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[Spirit.Qi] How do I use UTF-8 encoding with Qi

Henry Tan-2
I came across old thread http://sourceforge.net/mailarchive/forum.php?forum_name=spirit-general&max_rows=25&offset=7&style=nested&viewmonth=200811&viewday=25, between OvermindDL1 and Joel and Hartmut about the support of unicode/UTF8. It was like 1-2 year ago, in 2008 ?
 
I don't see much example in the pdf documentation about using UTF8 encoding for Qi and apologize if there is already a thread on this.
 
I believe by default Qi does not enable UTF8 support and one would encounter the below assertion.
 
Assertion failed: isascii_(ch), file c:\src\is2\public\ext\boost_1_41\boost\spirit\home\support\char_encoding\ascii.hpp, line 256
 
I see that there is utf8.hpp file in support/char_encoding. I am hoping that the support for utf8 is already in Spirit .Qi.
 
Thanks
 
Henry

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Joel de Guzman-2
On 1/28/2010 6:37 AM, Henry Tan wrote:

> I came across old thread
> http://sourceforge.net/mailarchive/forum.php?forum_name=spirit-general&max_rows=25&offset=7&style=nested&viewmonth=200811&viewday=25
> <http://sourceforge.net/mailarchive/forum.php?forum_name=spirit-general&max_rows=25&offset=7&style=nested&viewmonth=200811&viewday=25>,
> between OvermindDL1 and Joel and Hartmut about the support of
> unicode/UTF8. It was like 1-2 year ago, in 2008 ?
> I don't see much example in the pdf documentation about using UTF8
> encoding for Qi and apologize if there is already a thread on this.
> I believe by default Qi does not enable UTF8 support and one would
> encounter the below assertion.
> Assertion failed: isascii_(ch), file
> c:\src\is2\public\ext\boost_1_41\boost\spirit\home\support\char_encoding\ascii.hpp, line
> 256
> I see that there is utf8.hpp file in support/char_encoding. I am hoping
> that the support for utf8 is already in Spirit .Qi.

I'm working on it. We might have the char classifiers soon. I'll also revamp the
whole spirit encoding dance to make it more easier to use and fix the typical
usability problem when one wants to switch between encodings.

For now, you can use the pending Boost.Regex unicode iterators to expose your
UTF-8 stream into ::boost::uint32_t (see boost/regex/pending/unicode_iterator.hpp).
You won't yet have the char-classifiers, but you can use the char-set and range
parsers to get what you need and all other parsers are usable (except the
nocase directive).

There's so much to do... we always welcome help.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Henry Tan-2


On Wed, Jan 27, 2010 at 5:02 PM, Joel de Guzman <[hidden email]> wrote:
On 1/28/2010 6:37 AM, Henry Tan wrote:
> I came across old thread
> http://sourceforge.net/mailarchive/forum.php?forum_name=spirit-general&max_rows=25&offset=7&style=nested&viewmonth=200811&viewday=25
> <http://sourceforge.net/mailarchive/forum.php?forum_name=spirit-general&max_rows=25&offset=7&style=nested&viewmonth=200811&viewday=25>,
> between OvermindDL1 and Joel and Hartmut about the support of
> unicode/UTF8. It was like 1-2 year ago, in 2008 ?
> I don't see much example in the pdf documentation about using UTF8
> encoding for Qi and apologize if there is already a thread on this.
> I believe by default Qi does not enable UTF8 support and one would
> encounter the below assertion.
> Assertion failed: isascii_(ch), file
> c:\src\is2\public\ext\boost_1_41\boost\spirit\home\support\char_encoding\ascii.hpp, line
> 256
> I see that there is utf8.hpp file in support/char_encoding. I am hoping
> that the support for utf8 is already in Spirit .Qi.

I'm working on it. We might have the char classifiers soon. I'll also revamp the
whole spirit encoding dance to make it more easier to use and fix the typical
usability problem when one wants to switch between encodings.

For now, you can use the pending Boost.Regex unicode iterators to expose your
UTF-8 stream into ::boost::uint32_t (see boost/regex/pending/unicode_iterator.hpp).
You won't yet have the char-classifiers, but you can use the char-set and range
parsers to get what you need and all other parsers are usable (except the
nocase directive).

There's so much to do... we always welcome help.

Hi Joel:
 
Thanks for letting me know for the work in progress.
 
In my case, I do not need to detect unicode code characters but I just need that the parser does not crash when it receives non ascii when I input Chinese/Japanese characters. Basically I just need the spirit to accept unsigned char* array (8-bit in size).
 
What would be the short-term work around ?

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Joel de Guzman-2
On 1/28/2010 9:47 AM, Henry Tan wrote:
> Hi Joel:
> Thanks for letting me know for the work in progress.
> In my case, I do not need to detect unicode code characters but I just
> need that the parser does not crash when it receives non ascii when I
> input Chinese/Japanese characters. Basically I just need the spirit to
> accept unsigned char* array (8-bit in size).
> What would be the short-term work around ?

Well, don't use ascii. Use standard or standard_wide.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Henry Tan-2


On Wed, Jan 27, 2010 at 6:18 PM, Joel de Guzman <[hidden email]> wrote:
On 1/28/2010 9:47 AM, Henry Tan wrote:
> Hi Joel:
> Thanks for letting me know for the work in progress.
> In my case, I do not need to detect unicode code characters but I just
> need that the parser does not crash when it receives non ascii when I
> input Chinese/Japanese characters. Basically I just need the spirit to
> accept unsigned char* array (8-bit in size).
> What would be the short-term work around ?

Well, don't use ascii. Use standard or standard_wide.
 
Using standard does not work, standard_wide works fine. Using standard will cause assertion in isctype.c. I trace to the problem and found out that the standard.hpp::isspace() takes int, and the isspace does not like negative value. In this case the value that trigger the crashed was 239 (-17). So it would be better if we do static cast to unsigned int perhaps there?
 
There is some info below why isspace won't work/crash with int.
 
A bug?

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

OvermindDL1
On Wed, Jan 27, 2010 at 8:45 PM, Henry Tan <[hidden email]> wrote:

> On Wed, Jan 27, 2010 at 6:18 PM, Joel de Guzman <[hidden email]>
> wrote:
>>
>> On 1/28/2010 9:47 AM, Henry Tan wrote:
>> > Hi Joel:
>> > Thanks for letting me know for the work in progress.
>> > In my case, I do not need to detect unicode code characters but I just
>> > need that the parser does not crash when it receives non ascii when I
>> > input Chinese/Japanese characters. Basically I just need the spirit to
>> > accept unsigned char* array (8-bit in size).
>> > What would be the short-term work around ?
>>
>> Well, don't use ascii. Use standard or standard_wide.
>
> Using standard does not work, standard_wide works fine. Using standard will
> cause assertion in isctype.c. I trace to the problem and found out that the
> standard.hpp::isspace() takes int, and the isspace does not like negative
> value. In this case the value that trigger the crashed was 239 (-17). So it
> would be better if we do static cast to unsigned int perhaps there?
>
> There is some info below why isspace won't work/crash with int.
> http://www.greenend.org.uk/rjk/2001/02/cfu.html
>
> A bug?

I confirm this.  We need to bitcast the int to unsigned int before
passing it to functions like std::isspace and others.

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Joel de Guzman-2
In reply to this post by Henry Tan-2
On 1/28/2010 11:45 AM, Henry Tan wrote:
> Using standard does not work, standard_wide works fine. Using standard
> will cause assertion in isctype.c. I trace to the problem and found out
> that the standard.hpp::isspace() takes int, and the isspace does not
> like negative value. In this case the value that trigger the crashed was
> 239 (-17). So it would be better if we do static cast to unsigned int
> perhaps there?
> There is some info below why isspace won't work/crash with int.
> http://www.greenend.org.uk/rjk/2001/02/cfu.html
> A bug?

It's not a bug. The assert is in there to protect you.

Yes, standard expects char, not unsigned char. I'm confused. As you
say: "I do not need to detect unicode code characters", so why is
your subject "How do I use UTF-8 encoding with Qi"? What encoding
are you using? Those assertions are in place to make sure you are
using the right encoding, in your case, you are not.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Henry Tan-2


On Wed, Jan 27, 2010 at 8:15 PM, Joel de Guzman <[hidden email]> wrote:
On 1/28/2010 11:45 AM, Henry Tan wrote:
> Using standard does not work, standard_wide works fine. Using standard
> will cause assertion in isctype.c. I trace to the problem and found out
> that the standard.hpp::isspace() takes int, and the isspace does not
> like negative value. In this case the value that trigger the crashed was
> 239 (-17). So it would be better if we do static cast to unsigned int
> perhaps there?
> There is some info below why isspace won't work/crash with int.
> http://www.greenend.org.uk/rjk/2001/02/cfu.html
> A bug?

It's not a bug. The assert is in there to protect you.

Yes, standard expects char, not unsigned char. I'm confused. As you
say: "I do not need to detect unicode code characters", so why is
your subject "How do I use UTF-8 encoding with Qi"? What encoding
are you using? Those assertions are in place to make sure you are
using the right encoding, in your case, you are not.
 
The string I am passing is UTF-8 stream. Maybe I was not super clear about my intention. when I said: "I do not need to detect unicode code characters", I meant that I have a rule that only detects ASCII string (for example:)
 
MyRule = qi::string("foo:");
 
Somebody however can input UTF-8 stream into my program: 陳瑞名 and I don't want it crashes my program. In other words, if Qi can accept unsigned char, it won't crash my program. Unfortunately using standard.hpp as you suggested still crash my program because of "by design" it only works for char (ASCII).
 

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Joel de Guzman-2
On 1/28/2010 12:31 PM, Henry Tan wrote:

> On Wed, Jan 27, 2010 at 8:15 PM, Joel de Guzman
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     On 1/28/2010 11:45 AM, Henry Tan wrote:
>      > Using standard does not work, standard_wide works fine. Using
>     standard
>      > will cause assertion in isctype.c. I trace to the problem and
>     found out
>      > that the standard.hpp::isspace() takes int, and the isspace does not
>      > like negative value. In this case the value that trigger the
>     crashed was
>      > 239 (-17). So it would be better if we do static cast to unsigned int
>      > perhaps there?
>      > There is some info below why isspace won't work/crash with int.
>      > http://www.greenend.org.uk/rjk/2001/02/cfu.html
>      > A bug?
>
>     It's not a bug. The assert is in there to protect you.
>
>     Yes, standard expects char, not unsigned char. I'm confused. As you
>     say: "I do not need to detect unicode code characters", so why is
>     your subject "How do I use UTF-8 encoding with Qi"? What encoding
>     are you using? Those assertions are in place to make sure you are
>     using the right encoding, in your case, you are not.
>
> The string I am passing is UTF-8 stream. Maybe I was not super clear
> about my intention. when I said: "I do not need to detect unicode code
> characters", I meant that I have a rule that only detects ASCII string
> (for example:)
> MyRule = qi::string("foo:");
> Somebody however can input UTF-8 stream into my program: 陳瑞名 and I
> don't want it crashes my program. In other words, if Qi can accept
> unsigned char, it won't crash my program. Unfortunately using
> standard.hpp as you suggested still crash my program because of "by
> design" it only works for char (ASCII).

If you are indeed using UTF-8 stream, then it is wrong to use
standard or even standard_wide character classification parsers
like space, alpha, etc. First, UTF-8 is 8 bits, so it can't work with
standard which expects char as the underlying data type. If you use
standard_wide, then it is still wrong because a code-point in UTF-8
may span 1 to 4 bytes. While it may not crash, it will give the wrong
results. Your best bet (before spirit officially supports unicode) is
to use the Boost Regex unicode iterators to convert your UTF-8 into
::boost::uint32_t (see boost/regex/pending/unicode_iterator.hpp)
and use spirit::standard_wide over the unicode code-points generated.
You have to make sure that your platform's standard_wide fully support
unicode code points though. Some don't. See this:

     http://tinyurl.com/yexylza

for more info why.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Joel de Guzman-2
On 1/28/2010 1:08 PM, Joel de Guzman wrote:

> On 1/28/2010 12:31 PM, Henry Tan wrote:
>> On Wed, Jan 27, 2010 at 8:15 PM, Joel de Guzman
>> <[hidden email]<mailto:[hidden email]>>  wrote:
>>
>>      On 1/28/2010 11:45 AM, Henry Tan wrote:
>>       >  Using standard does not work, standard_wide works fine. Using
>>      standard
>>       >  will cause assertion in isctype.c. I trace to the problem and
>>      found out
>>       >  that the standard.hpp::isspace() takes int, and the isspace does not
>>       >  like negative value. In this case the value that trigger the
>>      crashed was
>>       >  239 (-17). So it would be better if we do static cast to unsigned int
>>       >  perhaps there?
>>       >  There is some info below why isspace won't work/crash with int.
>>       >  http://www.greenend.org.uk/rjk/2001/02/cfu.html
>>       >  A bug?
>>
>>      It's not a bug. The assert is in there to protect you.
>>
>>      Yes, standard expects char, not unsigned char. I'm confused. As you
>>      say: "I do not need to detect unicode code characters", so why is
>>      your subject "How do I use UTF-8 encoding with Qi"? What encoding
>>      are you using? Those assertions are in place to make sure you are
>>      using the right encoding, in your case, you are not.
>>
>> The string I am passing is UTF-8 stream. Maybe I was not super clear
>> about my intention. when I said: "I do not need to detect unicode code
>> characters", I meant that I have a rule that only detects ASCII string
>> (for example:)
>> MyRule = qi::string("foo:");
>> Somebody however can input UTF-8 stream into my program: 陳瑞名 and I
>> don't want it crashes my program. In other words, if Qi can accept
>> unsigned char, it won't crash my program. Unfortunately using
>> standard.hpp as you suggested still crash my program because of "by
>> design" it only works for char (ASCII).
>
> If you are indeed using UTF-8 stream, then it is wrong to use
> standard or even standard_wide character classification parsers
> like space, alpha, etc. First, UTF-8 is 8 bits, so it can't work with
> standard which expects char as the underlying data type. If you use
> standard_wide, then it is still wrong because a code-point in UTF-8
> may span 1 to 4 bytes. While it may not crash, it will give the wrong
> results. Your best bet (before spirit officially supports unicode) is
> to use the Boost Regex unicode iterators to convert your UTF-8 into
> ::boost::uint32_t (see boost/regex/pending/unicode_iterator.hpp)
> and use spirit::standard_wide over the unicode code-points generated.
> You have to make sure that your platform's standard_wide fully support
> unicode code points though. Some don't. See this:
>
>       http://tinyurl.com/yexylza
>
> for more info why.

Oh BTW, when we do have unicode support, then all you have to do is
switch from spirit::standard_wide to spirit::unicode. Also, we'll
make use of the Boost Regex unicode iterators as a standard for
Spirit.

Ok, tell you what, I'll check in my latest work on unicode this week. It
has all the Unicode-specific character classes (see General_Category Values
section here http://tinyurl.com/yads436). The first shot will not have
upper/lower conversions yet, but all the general categories will be
supported.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Henry Tan-2


On Wed, Jan 27, 2010 at 9:21 PM, Joel de Guzman <[hidden email]> wrote:
On 1/28/2010 1:08 PM, Joel de Guzman wrote:
> On 1/28/2010 12:31 PM, Henry Tan wrote:
>> On Wed, Jan 27, 2010 at 8:15 PM, Joel de Guzman
>> <[hidden email]<mailto:[hidden email]>>  wrote:
>>
>>      On 1/28/2010 11:45 AM, Henry Tan wrote:
>>       >  Using standard does not work, standard_wide works fine. Using
>>      standard
>>       >  will cause assertion in isctype.c. I trace to the problem and
>>      found out
>>       >  that the standard.hpp::isspace() takes int, and the isspace does not
>>       >  like negative value. In this case the value that trigger the
>>      crashed was
>>       >  239 (-17). So it would be better if we do static cast to unsigned int
>>       >  perhaps there?
>>       >  There is some info below why isspace won't work/crash with int.
>>       >  http://www.greenend.org.uk/rjk/2001/02/cfu.html
>>       >  A bug?
>>
>>      It's not a bug. The assert is in there to protect you.
>>
>>      Yes, standard expects char, not unsigned char. I'm confused. As you
>>      say: "I do not need to detect unicode code characters", so why is
>>      your subject "How do I use UTF-8 encoding with Qi"? What encoding
>>      are you using? Those assertions are in place to make sure you are
>>      using the right encoding, in your case, you are not.
>>
>> The string I am passing is UTF-8 stream. Maybe I was not super clear
>> about my intention. when I said: "I do not need to detect unicode code
>> characters", I meant that I have a rule that only detects ASCII string
>> (for example:)
>> MyRule = qi::string("foo:");
>> Somebody however can input UTF-8 stream into my program: 陳瑞名 and I
>> don't want it crashes my program. In other words, if Qi can accept
>> unsigned char, it won't crash my program. Unfortunately using
>> standard.hpp as you suggested still crash my program because of "by
>> design" it only works for char (ASCII).
>
> If you are indeed using UTF-8 stream, then it is wrong to use
> standard or even standard_wide character classification parsers
> like space, alpha, etc. First, UTF-8 is 8 bits, so it can't work with
> standard which expects char as the underlying data type. If you use
> standard_wide, then it is still wrong because a code-point in UTF-8
> may span 1 to 4 bytes. While it may not crash, it will give the wrong
> results. Your best bet (before spirit officially supports unicode) is
> to use the Boost Regex unicode iterators to convert your UTF-8 into
> ::boost::uint32_t (see boost/regex/pending/unicode_iterator.hpp)
> and use spirit::standard_wide over the unicode code-points generated.
> You have to make sure that your platform's standard_wide fully support
> unicode code points though. Some don't. See this:
>
>       http://tinyurl.com/yexylza
>
> for more info why.

Oh BTW, when we do have unicode support, then all you have to do is
switch from spirit::standard_wide to spirit::unicode. Also, we'll
make use of the Boost Regex unicode iterators as a standard for
Spirit.

Ok, tell you what, I'll check in my latest work on unicode this week. It
has all the Unicode-specific character classes (see General_Category Values
section here http://tinyurl.com/yads436). The first shot will not have
upper/lower conversions yet, but all the general categories will be
supported.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general

Wow, what a treat from you Joel! 

Regarding using standard_wide, how do you think that could impact the perf / memory. How differ will that be from using ascii/unsigned char?

Regarding your point below about UTF-8 stream

"... If you are indeed using UTF-8 stream, then it is wrong to use
standard or even standard_wide character classification parsers
like space, alpha, etc. First, UTF-8 is 8 bits, so it can't work with
standard which expects char as the underlying data type.  .... "

Yes I agree it is wrong to use standard/standard_wide per current state where it uses int as the storage type, so using unsigned char will be violating the intention of declaring standard/standard_wide. My requirement is however, I want to take unsigned char stream and perhaps I don't need to know if whether it is a UTF-8 encoding or not. I will be perfectly fine if value between 0-255 can be accepted. If such condition is met, and assuming that there is an external module that checks for validity of UTF-8 encoding, then Qi will work against UTF-8 stream easily.

As a matter of fact, I just added a variant of standard.hpp and I named it as uint8.hpp. The change was to static cast int to unsigned char and by using the unsigned char* as iterator, it works perfectly fine for me!

In any case, I will wait for your code checkin early of next week and I want to try it out! if perf is not a drag, then I would definitely go for your new code.

Thanks!

HT

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Joel de Guzman-2
On 1/28/2010 7:57 PM, Henry Tan wrote:

> Regarding using standard_wide, how do you think that could impact the
> perf / memory. How differ will that be from using ascii/unsigned char?
>
> Regarding your point below about UTF-8 stream
>
> "... If you are indeed using UTF-8 stream, then it is wrong to use
> standard or even standard_wide character classification parsers
> like space, alpha, etc. First, UTF-8 is 8 bits, so it can't work with
> standard which expects char as the underlying data type.  .... "
>
> Yes I agree it is wrong to use standard/standard_wide per current state
> where it uses int as the storage type, so using unsigned char will be
> violating the intention of declaring standard/standard_wide. My
> requirement is however, I want to take unsigned char stream and perhaps
> I don't need to know if whether it is a UTF-8 encoding or not. I will be
> perfectly fine if value between 0-255 can be accepted. If such condition
> is met, and assuming that there is an external module that checks for
> validity of UTF-8 encoding, then Qi will work against UTF-8 stream easily.
>
> As a matter of fact, I just added a variant of standard.hpp and I named
> it as uint8.hpp. The change was to static cast int to unsigned char and
> by using the unsigned char* as iterator, it works perfectly fine for me!

That does not make sense. You will get wrong results. For example,
consider the unicode code point for the Euro sign:

     U+20AC

The UTF-8 encoding for that is

     0xE2,0x82,0xAC

Take note that that is 3 bytes! Now, if you check 0xE2 against isalpha,
for example, you may get a true result (0xE2 is â), but the Euro sign
IS NOT(!) an alphabetic character. You will have a better chance with
standard_wide if your underlying OS supports unicode for standard_wide.
How different will that be from using ascii/unsigned char? It will be slower,
of course, but it is the right thing to do.

Now, if you don't know the underlying encoding whether it is UTF-8 or
not, then you are out of luck. You can't, I repeat, you can't(!) use
the char classification parsers for them. The meaning of isalpha is
encoding dependent.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Henry Tan-2


On Thu, Jan 28, 2010 at 5:11 AM, Joel de Guzman <[hidden email]> wrote:
On 1/28/2010 7:57 PM, Henry Tan wrote:

> Regarding using standard_wide, how do you think that could impact the
> perf / memory. How differ will that be from using ascii/unsigned char?
>
> Regarding your point below about UTF-8 stream
>
> "... If you are indeed using UTF-8 stream, then it is wrong to use
> standard or even standard_wide character classification parsers
> like space, alpha, etc. First, UTF-8 is 8 bits, so it can't work with
> standard which expects char as the underlying data type.  .... "
>
> Yes I agree it is wrong to use standard/standard_wide per current state
> where it uses int as the storage type, so using unsigned char will be
> violating the intention of declaring standard/standard_wide. My
> requirement is however, I want to take unsigned char stream and perhaps
> I don't need to know if whether it is a UTF-8 encoding or not. I will be
> perfectly fine if value between 0-255 can be accepted. If such condition
> is met, and assuming that there is an external module that checks for
> validity of UTF-8 encoding, then Qi will work against UTF-8 stream easily.
>
> As a matter of fact, I just added a variant of standard.hpp and I named
> it as uint8.hpp. The change was to static cast int to unsigned char and
> by using the unsigned char* as iterator, it works perfectly fine for me!

That does not make sense. You will get wrong results. For example,
consider the unicode code point for the Euro sign:

    U+20AC

The UTF-8 encoding for that is

    0xE2,0x82,0xAC

Take note that that is 3 bytes! Now, if you check 0xE2 against isalpha,
for example, you may get a true result (0xE2 is â), but the Euro sign
IS NOT(!) an alphabetic character. You will have a better chance with
standard_wide if your underlying OS supports unicode for standard_wide.
How different will that be from using ascii/unsigned char? It will be slower,
of course, but it is the right thing to do.

Now, if you don't know the underlying encoding whether it is UTF-8 or
not, then you are out of luck. You can't, I repeat, you can't(!) use
the char classification parsers for them. The meaning of isalpha is
encoding dependent.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general

 
It does not make sense if you are using 'char' as the magic word. In my assumption I was working on unsigned char not char. Maybe the current implementation is not really for unsigned char but what I am saying is what if I defined my own version of isalpha/isspace/isxxx instead of forcing against the char?
 
As a matter of fact, you can run the following program and you will see that 0xE2 is not an alpha/space/lower, etc ...Forget for a second if whether this is the right way of doing this or not. But say if (hypothetically) your isalpha is designed to work against 8-bit, I don't see why it would classify a value outside of [65-90], [95-122] range as an alpha. 0xE2 is 226 and it is outside of [65-90],[95-122] range.
 
I don't totally dispute you about the right more proper way of doing this, i.e. to use standard_wide. But perhaps, there can be a more efficient solution to handle UTF8 by working on 8-bit data rather than on wide data, because UTF-8 is 8-bit and it is different than unicode which you would need wide for sure.
 
#include <ctype.h>
#include <stdio.h>
typedef bool (__stdcall * charclassifier)(int);
void classify(charclassifier f);
int __cdecl main()

    charclassifier alpha = charclassifier(isalpha);
    classify(alpha);
    charclassifier space = charclassifier(isspace);
    classify(space);
    charclassifier lower = charclassifier(islower);
    classify(lower);
    return 0;
}
void classify(charclassifier f)
{
    int i = 0;
    while(i < 255)
    {                        
        if (f((unsigned char)i))
        {
            printf("%c (%d) = true\n", i, i);
        }
        else
        {
            printf("%c = (%d) false\n", i, i);
        }
        ++i;
    }
}

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Joel de Guzman-2
On 1/29/2010 4:43 AM, Henry Tan wrote:

>
>
> On Thu, Jan 28, 2010 at 5:11 AM, Joel de Guzman
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     On 1/28/2010 7:57 PM, Henry Tan wrote:
>
>      > Regarding using standard_wide, how do you think that could impact the
>      > perf / memory. How differ will that be from using ascii/unsigned
>     char?
>      >
>      > Regarding your point below about UTF-8 stream
>      >
>      > "... If you are indeed using UTF-8 stream, then it is wrong to use
>      > standard or even standard_wide character classification parsers
>      > like space, alpha, etc. First, UTF-8 is 8 bits, so it can't work with
>      > standard which expects char as the underlying data type.  .... "
>      >
>      > Yes I agree it is wrong to use standard/standard_wide per current
>     state
>      > where it uses int as the storage type, so using unsigned char will be
>      > violating the intention of declaring standard/standard_wide. My
>      > requirement is however, I want to take unsigned char stream and
>     perhaps
>      > I don't need to know if whether it is a UTF-8 encoding or not. I
>     will be
>      > perfectly fine if value between 0-255 can be accepted. If such
>     condition
>      > is met, and assuming that there is an external module that checks for
>      > validity of UTF-8 encoding, then Qi will work against UTF-8
>     stream easily.
>      >
>      > As a matter of fact, I just added a variant of standard.hpp and I
>     named
>      > it as uint8.hpp. The change was to static cast int to unsigned
>     char and
>      > by using the unsigned char* as iterator, it works perfectly fine
>     for me!
>
>     That does not make sense. You will get wrong results. For example,
>     consider the unicode code point for the Euro sign:
>
>          U+20AC
>
>     The UTF-8 encoding for that is
>
>          0xE2,0x82,0xAC
>
>     Take note that that is 3 bytes! Now, if you check 0xE2 against isalpha,
>     for example, you may get a true result (0xE2 is â), but the Euro sign
>     IS NOT(!) an alphabetic character. You will have a better chance with
>     standard_wide if your underlying OS supports unicode for standard_wide.
>     How different will that be from using ascii/unsigned char? It will
>     be slower,
>     of course, but it is the right thing to do.
>
>     Now, if you don't know the underlying encoding whether it is UTF-8 or
>     not, then you are out of luck. You can't, I repeat, you can't(!) use
>     the char classification parsers for them. The meaning of isalpha is
>     encoding dependent.
>
>
> It does not make sense if you are using 'char' as the magic word. In my
> assumption I was working on unsigned char not char. Maybe the current
> implementation is not really for unsigned char but what I am saying is
> what if I defined my own version of isalpha/isspace/isxxx instead of
> forcing against the char?
> As a matter of fact, you can run the following program and you will see
> that 0xE2 is not an alpha/space/lower, etc ...Forget for a second if
> whether this is the right way of doing this or not. But say if
> (hypothetically) your isalpha is designed to work against 8-bit, I don't
> see why it would classify a value outside of [65-90], [95-122] range as
> an alpha. 0xE2 is 226 and it is outside of [65-90],[95-122] range.
> I don't totally dispute you about the right more proper way of doing
> this, i.e. to use standard_wide. But perhaps, there can be a more
> efficient solution to handle UTF8 by working on 8-bit data rather than
> on wide data, because UTF-8 is 8-bit and it is different than unicode
> which you would need wide for sure.

UTF-8 *IS* unicode (8-bit UCS/Unicode Transformation Format). UTF-8
encodes each character (code point) in 1 to 4 octets (8-bit bytes),
with the single octet encoding used only for the 128 US-ASCII
characters.

Henry, you are missing the whole point. UTF-8 is unsigned char alright,
but the that's besides the point. The point is that you can't use UTF-8
characters as-is because you need 1, 2, 3 or 4 UTF-8 characters to
form a code-point. You first need to convert the UTF-8 stream into a
unicode (uint32) stream. Any, I repeat *any*, char classification scheme,
regardless if it works on char or unsigned char, will not work on UTF-8
chars (or any variable length encoding, for that matter [e.g Shift-JIS]).

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Joel de Guzman-2
On 1/29/2010 6:58 AM, Joel de Guzman wrote:

> On 1/29/2010 4:43 AM, Henry Tan wrote:
>> It does not make sense if you are using 'char' as the magic word. In my
>> assumption I was working on unsigned char not char. Maybe the current
>> implementation is not really for unsigned char but what I am saying is
>> what if I defined my own version of isalpha/isspace/isxxx instead of
>> forcing against the char?
>> As a matter of fact, you can run the following program and you will see
>> that 0xE2 is not an alpha/space/lower, etc ...Forget for a second if
>> whether this is the right way of doing this or not. But say if
>> (hypothetically) your isalpha is designed to work against 8-bit, I don't
>> see why it would classify a value outside of [65-90], [95-122] range as
>> an alpha. 0xE2 is 226 and it is outside of [65-90],[95-122] range.
>> I don't totally dispute you about the right more proper way of doing
>> this, i.e. to use standard_wide. But perhaps, there can be a more
>> efficient solution to handle UTF8 by working on 8-bit data rather than
>> on wide data, because UTF-8 is 8-bit and it is different than unicode
>> which you would need wide for sure.
>
> UTF-8 *IS* unicode (8-bit UCS/Unicode Transformation Format). UTF-8
> encodes each character (code point) in 1 to 4 octets (8-bit bytes),
> with the single octet encoding used only for the 128 US-ASCII
> characters.
>
> Henry, you are missing the whole point. UTF-8 is unsigned char alright,
> but the that's besides the point. The point is that you can't use UTF-8
> characters as-is because you need 1, 2, 3 or 4 UTF-8 characters to
> form a code-point. You first need to convert the UTF-8 stream into a
> unicode (uint32) stream. Any, I repeat *any*, char classification scheme,
> regardless if it works on char or unsigned char, will not work on UTF-8
> chars (or any variable length encoding, for that matter [e.g Shift-JIS]).

Ok, having said that, here's what you can do if you want to use the
ASCII subset of UTF-8 on the spirit char classifiers: copy the ascii
encoding code and simply change this:

     BOOST_ASSERT(isascii_(ch));

to this:

     if (!isascii_(ch))
         return false;

and for the tolower/toupper, change it to:

     if (!isascii_(ch))
         return ch;

You can probably call it non_strict_ascii.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Henry Tan-2


On Thu, Jan 28, 2010 at 3:23 PM, Joel de Guzman <[hidden email]> wrote:
On 1/29/2010 6:58 AM, Joel de Guzman wrote:
> On 1/29/2010 4:43 AM, Henry Tan wrote:
>> It does not make sense if you are using 'char' as the magic word. In my
>> assumption I was working on unsigned char not char. Maybe the current
>> implementation is not really for unsigned char but what I am saying is
>> what if I defined my own version of isalpha/isspace/isxxx instead of
>> forcing against the char?
>> As a matter of fact, you can run the following program and you will see
>> that 0xE2 is not an alpha/space/lower, etc ...Forget for a second if
>> whether this is the right way of doing this or not. But say if
>> (hypothetically) your isalpha is designed to work against 8-bit, I don't
>> see why it would classify a value outside of [65-90], [95-122] range as
>> an alpha. 0xE2 is 226 and it is outside of [65-90],[95-122] range.
>> I don't totally dispute you about the right more proper way of doing
>> this, i.e. to use standard_wide. But perhaps, there can be a more
>> efficient solution to handle UTF8 by working on 8-bit data rather than
>> on wide data, because UTF-8 is 8-bit and it is different than unicode
>> which you would need wide for sure.
>
> UTF-8 *IS* unicode (8-bit UCS/Unicode Transformation Format). UTF-8
> encodes each character (code point) in 1 to 4 octets (8-bit bytes),
> with the single octet encoding used only for the 128 US-ASCII
> characters.
>
> Henry, you are missing the whole point. UTF-8 is unsigned char alright,
> but the that's besides the point. The point is that you can't use UTF-8
> characters as-is because you need 1, 2, 3 or 4 UTF-8 characters to
> form a code-point. You first need to convert the UTF-8 stream into a
> unicode (uint32) stream. Any, I repeat *any*, char classification scheme,
> regardless if it works on char or unsigned char, will not work on UTF-8
> chars (or any variable length encoding, for that matter [e.g Shift-JIS]).

Ok, having said that, here's what you can do if you want to use the
ASCII subset of UTF-8 on the spirit char classifiers: copy the ascii
encoding code and simply change this:

    BOOST_ASSERT(isascii_(ch));

to this:

    if (!isascii_(ch))
        return false;

and for the tolower/toupper, change it to:

    if (!isascii_(ch))
        return ch;

You can probably call it non_strict_ascii.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general

 
There you go, I am happy at least I am (my scenario) being finally understood :)
 
What I did basically, I const_cast the int ch you pass in standard.hpp to unsigned char. In fact, Joel I have a point here you are passing an int alright? what if the int value is 650 ? that is not within a char  range, yet I think the code contracts allows it as it passes an int. In fact the assertion in the isalpha() was saying assert: (c + 1) <= 256. So in a way ctype.h was really saying that it should be unsigned char instead of int ? At least it accept the 0-255 which is the unsigned char.
 
For, my scenario, really just doing that fixes a big thing. Note that my scenario is not really about processing grammar in UTF-8. My grammar is pure ASCII but I want to accept the 128-255 value and I want to just take it as it is don't change the stream and just make a pass through of Qi. I do not use Qi as the UTF8 processor, I just need Qi to preserve the byte ordering; be it ASCII or char(128-255). Someone else doing the precondition check and post processing on UTF-8.
 
In fact, by just casting the int to unsigned char in standard.hpp it works perfectly well for me!
 
As an example:
 
input: 'j' 'o' 'e' 'l' ':' '226' '129' '131'
 
grammar=> Root %= qi::string("joel:") >> (+alnum | +char_((unsigned char)128, (unsigned char)255).);
 
Will accept the input and it will preserve the byte ordering. So be it a UTF-8 or not, the goal is for me to detect the grammar correctly and preserve the byte ordering.
 
When I evaluate the string return by Root in my parse tree node, the above example will give me: "joel:" + (unsigned char)226 + (unsigned char)129 + (unsigned char)131.
 
See, I am not trying to suggest this is how you process grammar in UTF-8, but I am suggesting a way to let UTF-8 stream pass through to QI and your grammar in ASCII will still work fine!

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Henry Tan-2


On Thu, Jan 28, 2010 at 3:50 PM, Henry Tan <[hidden email]> wrote:


On Thu, Jan 28, 2010 at 3:23 PM, Joel de Guzman <[hidden email]> wrote:
On 1/29/2010 6:58 AM, Joel de Guzman wrote:
> On 1/29/2010 4:43 AM, Henry Tan wrote:
>> It does not make sense if you are using 'char' as the magic word. In my
>> assumption I was working on unsigned char not char. Maybe the current
>> implementation is not really for unsigned char but what I am saying is
>> what if I defined my own version of isalpha/isspace/isxxx instead of
>> forcing against the char?
>> As a matter of fact, you can run the following program and you will see
>> that 0xE2 is not an alpha/space/lower, etc ...Forget for a second if
>> whether this is the right way of doing this or not. But say if
>> (hypothetically) your isalpha is designed to work against 8-bit, I don't
>> see why it would classify a value outside of [65-90], [95-122] range as
>> an alpha. 0xE2 is 226 and it is outside of [65-90],[95-122] range.
>> I don't totally dispute you about the right more proper way of doing
>> this, i.e. to use standard_wide. But perhaps, there can be a more
>> efficient solution to handle UTF8 by working on 8-bit data rather than
>> on wide data, because UTF-8 is 8-bit and it is different than unicode
>> which you would need wide for sure.
>
> UTF-8 *IS* unicode (8-bit UCS/Unicode Transformation Format). UTF-8
> encodes each character (code point) in 1 to 4 octets (8-bit bytes),
> with the single octet encoding used only for the 128 US-ASCII
> characters.
>
> Henry, you are missing the whole point. UTF-8 is unsigned char alright,
> but the that's besides the point. The point is that you can't use UTF-8
> characters as-is because you need 1, 2, 3 or 4 UTF-8 characters to
> form a code-point. You first need to convert the UTF-8 stream into a
> unicode (uint32) stream. Any, I repeat *any*, char classification scheme,
> regardless if it works on char or unsigned char, will not work on UTF-8
> chars (or any variable length encoding, for that matter [e.g Shift-JIS]).

Ok, having said that, here's what you can do if you want to use the
ASCII subset of UTF-8 on the spirit char classifiers: copy the ascii
encoding code and simply change this:

    BOOST_ASSERT(isascii_(ch));

to this:

    if (!isascii_(ch))
        return false;

and for the tolower/toupper, change it to:

    if (!isascii_(ch))
        return ch;

You can probably call it non_strict_ascii.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general

 
There you go, I am happy at least I am (my scenario) being finally understood :)
 
What I did basically, I const_cast the int ch you pass in standard.hpp to unsigned char. In fact, Joel I have a point here you are passing an int alright? what if the int value is 650 ? that is not within a char  range, yet I think the code contracts allows it as it passes an int. In fact the assertion in the isalpha() was saying assert: (c + 1) <= 256. So in a way ctype.h was really saying that it should be unsigned char instead of int ? At least it accept the 0-255 which is the unsigned char.
 
For, my scenario, really just doing that fixes a big thing. Note that my scenario is not really about processing grammar in UTF-8. My grammar is pure ASCII but I want to accept the 128-255 value and I want to just take it as it is don't change the stream and just make a pass through of Qi. I do not use Qi as the UTF8 processor, I just need Qi to preserve the byte ordering; be it ASCII or char(128-255). Someone else doing the precondition check and post processing on UTF-8.
 
In fact, by just casting the int to unsigned char in standard.hpp it works perfectly well for me!
 
As an example:
 
input: 'j' 'o' 'e' 'l' ':' '226' '129' '131'
 
grammar=> Root %= qi::string("joel:") >> (+alnum | +char_((unsigned char)128, (unsigned char)255).);
 
Will accept the input and it will preserve the byte ordering. So be it a UTF-8 or not, the goal is for me to detect the grammar correctly and preserve the byte ordering.
 
When I evaluate the string return by Root in my parse tree node, the above example will give me: "joel:" + (unsigned char)226 + (unsigned char)129 + (unsigned char)131.
 
See, I am not trying to suggest this is how you process grammar in UTF-8, but I am suggesting a way to let UTF-8 stream pass through to QI and your grammar in ASCII will still work fine!

 
BTW, I thought the cleaner way to add my custom char_encoding is to add new file instead of changing the existing one. if let's say I want to add a new category, say uchar8.hpp, what list of changes that I need to make. For example the list of files containing standard_wide keyword are the following. Is that as easy as just following the trail of the standard_wide presence and do similar declaration,etc?
 
home\karma\char\char.hpp:      : detail::basic_literal<Modifiers, char_encoding::standard_wide> {};
home\karma\char\char.hpp:      : detail::basic_literal<Modifiers, char_encoding::standard_wide> {};
home\karma\char\char_class.hpp:    namespace standard_wide { using namespace boost::spirit::standard_wide; }
home\lex\lexer\char_token_def.hpp:      : detail::basic_literal<char_encoding::standard_wide> {};
home\lex\lexer\char_token_def.hpp:      : detail::basic_literal<char_encoding::standard_wide> {};
home\qi\char\char.hpp:      : detail::basic_literal<Modifiers, char_encoding::standard_wide> {};
home\qi\char\char.hpp:      : detail::basic_literal<Modifiers, char_encoding::standard_wide> {};
home\qi\char\char_class.hpp:    namespace standard_wide { using namespace boost::spirit::standard_wide; }
home\support\char_encoding\standard_wide.hpp:#if !defined(BOOST_SPIRIT_STANDARD_WIDE_NOVEMBER_10_2006_0913AM)
home\support\char_encoding\standard_wide.hpp:#define BOOST_SPIRIT_STANDARD_WIDE_NOVEMBER_10_2006_0913AM
home\support\char_encoding\standard_wide.hpp:    struct standard_wide
home\support\common_terminals.hpp:#include <boost/spirit/home/support/char_encoding/standard_wide.hpp>
home\support\common_terminals.hpp:// each for ascii, iso8859_1, standard and standard_wide. These placeholders
home\support\common_terminals.hpp:BOOST_SPIRIT_DEFINE_CHAR_CODES(standard_wide)
include\support_standard_wide.hpp:#ifndef BOOST_SPIRIT_INCLUDE_SUPPORT_STANDARD_WIDE
include\support_standard_wide.hpp:#define BOOST_SPIRIT_INCLUDE_SUPPORT_STANDARD_WIDE
include\support_standard_wide.hpp:#include <boost/spirit/home/support/char_encoding/standard_wide.hpp>

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Joel de Guzman-2
In reply to this post by Henry Tan-2
On 1/29/2010 7:50 AM, Henry Tan wrote:

> There you go, I am happy at least I am (my scenario) being finally
> understood :)
> What I did basically, I const_cast the int ch you pass in standard.hpp
> to unsigned char. In fact, Joel I have a point here you are passing an
> int alright? what if the int value is 650 ?

Well, that is not possible. See char_class.hpp. For example:

     typedef typename CharEncoding::char_type char_type;

     template <typename Char>
     static bool
     is(tag::char_, Char ch)
     {
         return CharEncoding::ischar(char_type(ch));
     }

that is not within a char
> range, yet I think the code contracts allows it as it passes an int. In

The API you are looking at is not public. You shouldn't be deducing
any "contract" from it.

> fact the assertion in the isalpha() was saying assert: (c + 1) <= 256.
> So in a way ctype.h was really saying that it should be *unsigned char*
> instead of *int* ?

What you are looking at is implementation defined. You should not
rely on that. Some platforms can very well require only ascii,
some may allow iso8859 and its variants, etc.

> At least it accept the 0-255 which is the unsigned char.

Sure, but the definition of the characters in the extended character
set (above 0x7F) may belong to diverse categories depending on the
locale and the platform. So, you can't really predict if a character
is an alpha or not, for example. For parsers, this is not good. One
platform, your grammar works, on another it will fail. A stable
grammar is important: one that works reliably independent from platform.
That is why I don't quite like adding "locales" to spirit.

> For, my scenario, really just doing that fixes a big thing. Note that my
> scenario is not really about processing grammar in UTF-8. My grammar is
> pure ASCII but I want to accept the 128-255 value and I want to just
> take it as it is don't change the stream and just make a pass through of
> Qi. I do not use Qi as the UTF8 processor, I just need Qi to preserve
> the byte ordering; be it ASCII or char(128-255). Someone else doing the
> precondition check and post processing on UTF-8.
> In fact, by just casting the int to unsigned char in standard.hpp it
> works perfectly well for me!
> As an example:
> input: 'j' 'o' 'e' 'l' ':' '226' '129' '131'
> grammar=> Root %= qi::string("joel:") >> (+alnum | +char_((unsigned
> char)128, (unsigned char)255).);
> Will accept the input and it will preserve the byte ordering. So be it a
> UTF-8 or not, the goal is for me to detect the grammar correctly and
> preserve the byte ordering.
> When I evaluate the string return by Root in my parse tree node, the
> above example will give me: "joel:" + (unsigned char)226 + (unsigned
> char)129 + (unsigned char)131.
> See, I am not trying to suggest this is how you process grammar in
> UTF-8, but I am suggesting a way to let UTF-8 stream pass through to QI
> and your grammar in ASCII will still work fine!

Good if it works for you. It seems fragile to me to rely on your
platform's behavior. To me, I'd rather stick to ascii and just
return false on non-ascii.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Spirit.Qi] How do I use UTF-8 encoding with Qi

Joel de Guzman-2
In reply to this post by Henry Tan-2
On 1/29/2010 8:10 AM, Henry Tan wrote:

> BTW, I thought the cleaner way to add my custom char_encoding is to add
> new file instead of changing the existing one. if let's say I want to
> add a new category, say uchar8.hpp, what list of changes that I need to
> make. For example the list of files containing standard_wide keyword are
> the following. Is that as easy as just following the trail of the
> standard_wide presence and do similar declaration,etc?

Yep :-) It's as easy as that. I'm doing the same for unicode as we speak.

Regards,
--
Joel de Guzman
http://www.boostpro.com
http://spirit.sf.net
http://www.facebook.com/djowel

Meet me at BoostCon
http://www.boostcon.com/home
http://www.facebook.com/boostcon




------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Loading...