[Spirit] Looking for a little Qi guidance for Unicode parsing

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spirit] Looking for a little Qi guidance for Unicode parsing

Boost - Users mailing list
Hello,

I am turning a corner in my JSON parser. I support ASCII through and
through, but now I want to support Unicode, apparently UTF-8, part of
the JSON standard. From what I can tell, this is not the entire
grammar, but just for Strings.

Looking for a little guidance on how to approach that issue, the
elements involved, etc. Such as, are we talking about C++
std::wstring? I have also seen std::u32string referenced in some
forums.

To begin with, it is a somewhat naive impression, would the characters
not translate to unsigned char or char, but rather to
std::wstring::value_type or std::u32string::value_type? Things like
that come to mind approaching the issue.

Additionally, how to otherwise handle symbol tables such as escape
characters, i.e. from:

struct escapes_t : qi::symbols<char, char> {
    escapes_t() {
        this->add("\\b", '\b')
            ("\\f", '\f')
            ("\\n", '\n')
            ("\\r", '\r')
            ("\\t", '\t')
            ("\\v", '\v')
            ("\\\\", '\\')
            ("\\/", '/')
            ("\\'", '\'')
            ("\\\"", '"')
            ;
    }
} char_esc;

And on from there.

Thanks!

Best regards,

Michael W Powell
_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [Spirit] Looking for a little Qi guidance for Unicode parsing

Boost - Users mailing list
On Sun, Jan 27, 2019 at 11:05 Michael Powell via Boost-users <[hidden email]> wrote:
Hello,

I am turning a corner in my JSON parser. I support ASCII through and
through, but now I want to support Unicode, apparently UTF-8, part of
the JSON standard. From what I can tell, this is not the entire
grammar, but just for Strings.

Looking for a little guidance on how to approach that issue, the
elements involved, etc. Such as, are we talking about C++
std::wstring? I have also seen std::u32string referenced in some
forums.

To begin with, it is a somewhat naive impression, would the characters
not translate to unsigned char or char, but rather to
std::wstring::value_type or std::u32string::value_type? Things like
that come to mind approaching the issue.

Additionally, how to otherwise handle symbol tables such as escape
characters, i.e. from:

struct escapes_t : qi::symbols<char, char> {
    escapes_t() {
        this->add("\\b", '\b')
            ("\\f", '\f')
            ("\\n", '\n')
            ("\\r", '\r')
            ("\\t", '\t')
            ("\\v", '\v')
            ("\\\\", '\\')
            ("\\/", '/')
            ("\\'", '\'')
            ("\\\"", '"')
            ;
    }
} char_esc;

And on from there.

Thanks!

Best regards,

Michael W Powell
_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users

The answer to your question is a bit more complicate than you might expect. In short, std::string is capable of representing Unicode text, as the difference between binary representation (bits and bytes) and meaning (codepoints). It would probably be illuminating for you to watch a talk called “Unicode in C++” by James McNellis (https://m.youtube.com/watch?v=tOHnXt3Ycfo). 
--
Travis Göckel
+1.720.234.9330

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users