Qi + UTF-32 (Unicode) question regarding performance (small example attached)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Qi + UTF-32 (Unicode) question regarding performance (small example attached)

Mathias Born
Hi,

I'd appreciate advice on how to achieve best performance. I need to parse
text encoded as UTF-32.
A minimal example is attached and looks like this:

--start--

#define BOOST_SPIRIT_UNICODE

#include <string>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/qi.hpp>

using namespace std::string_literals;
using namespace boost;
using namespace boost::spirit;


// This is needed to make qi::parse work with
std::u32string::const_iterator.
namespace boost {
        template<> struct is_scalar<std::u32string::const_iterator> : public
true_type {};
}

// Enable auto conversion from std::u32string (UTF-32) to std::string
(UTF-8).
namespace boost {
        namespace spirit {
                namespace traits {
                        template <> // <typename Attrib, typename T,
typename Enable>
                        struct assign_to_container_from_value<std::string,
std::u32string, void>
                        {
                                typedef
u32_to_u8_iterator<std::u32string::const_iterator> Conv;

                                static void call(std::u32string const& val,
std::string& attr) {
                                        attr.assign(Conv(val.begin()),
Conv(val.end()));
                                }
                        };
                }
        }
}

int main()
{
        qi::rule<std::u32string::const_iterator, std::u32string()> test1 =
"ASCII" > +unicode::char_;
        qi::symbols<char32_t> syms;
        syms.add(U"sym");
        qi::rule<std::u32string::const_iterator, std::u32string()> test2 =
L"WIDE" > syms > +unicode::char_;

        // The following line doesn't compile. Error msg:
        //
c:\cpp\boost_1_63_0\boost\spirit\home\qi\nonterminal\rule.hpp(177) : error
C2338 : error_invalid_expression
        // qi::rule<std::u32string::const_iterator, std::u32string()> test3
= U"U32" > +unicode::char_;

        auto input1 = U"ASCIIfoo"s;
        auto input2 = U"WIDEsymbar"s;

        std::string attr1, attr2;

        auto result1 = qi::parse(input1.cbegin(), input1.cend(), test1,
attr1);
        std::cout << result1 << " " << attr1 << std::endl;

        auto result2 = qi::parse(input2.cbegin(), input2.cend(), test2,
attr2);
        std::cout << result2 << " " << attr2 << std::endl;

        return 0;
}

--end--

Output is:
---
1 foo
1 bar
---

As you can see, I already figured out how to make qi accept u32string as
input, and all parsers work as expected.
However, I wonder what happens behind the scenes. Parsers "test1" and
"test2" use literals which are not UTF-32.
(At least on Windows, where wchar_t is 16 bit.)
But the input is, so isn't there a conversion to UTF-32 necessary at
runtime? If so, I'd like to use UTF-32 literals
in order to avoid any conversion, but that doesn't compile (see "test3").

But somehow spirit is already capable of processing UTF-32 literals, as
"syms" demonstrates.
Looks to me like I may just be missing some magic template to enable it.

Best Regards,
Mathias



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general

simple.cpp (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Qi + UTF-32 (Unicode) question regarding performance (small example attached)

Mathias Born
Hi,

I found out by following the flow in a debugger,
and so I answer the question myself.

In the end, all parsing is done by comparing characters.
For example, the literal parser (in home/qi/detail/string_parse.hpp)
does:

        for (; !!ch; ++i)
        {
            if (i == last || (ch != *i))
                return false;
            ch = *++str;
        }

where (in the example I posted) the type of ch is "char" and the type
of *i is "char32_t". This means there is no unnecessary conversion
and there should be no performance problem at all.
The same applies to numerical parsers.

In order to use 32-bit literals, one would probably have to write
corresponding trait-specializations following the contents of
home/support/string_traits.hpp, but I didn't try that.

Hope this helps anyone,
Best Regards,
Mathias

> -----Original Message-----
> From: Mathias Born [mailto:[hidden email]]
> Sent: Montag, 23. Januar 2017 22:06
> To: [hidden email]
> Subject: [Spirit-general] Qi + UTF-32 (Unicode) question regarding
> performance (small example attached)
>
> Hi,
>
> I'd appreciate advice on how to achieve best performance. I need to parse
> text encoded as UTF-32.
> A minimal example is attached and looks like this:
> ...
> However, I wonder what happens behind the scenes. Parsers "test1" and
> "test2" use literals which are not UTF-32.
> (At least on Windows, where wchar_t is 16 bit.)
> But the input is, so isn't there a conversion to UTF-32 necessary at
> runtime? If so, I'd like to use UTF-32 literals
> in order to avoid any conversion, but that doesn't compile (see "test3").



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general