Interest in Unicode library for Boost?

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Interest in Unicode library for Boost?

Boost - Dev mailing list
I've been working on a Unicode library for submission to Boost, with an eye
toward standardizing robust Unicode support for C++.

It started as a better string library for namespace "std2", with minimal
Unicode support.  Though "std2" may never happen, those string types are
still in there, and the library has grown to also include all the Unicode
features most users will ever need.

You can find the Github page here: https://github.com/tzlaine/text

You can find the online docs here: https://tzlaine.github.io/text

If you care about portable Unicode support, or even addressing the
embarrassment of being the only major production language with next to no
Unicode support, please have a look and provide feedback.

I gave a talk about this at C++Now in May, though it's a bit out of date,
as the library was not then finished.  It's three hours, so, y'know, maybe
skip it.  For completeness' sake:

https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH
https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
On 9/23/18 7:45 AM, Zach Laine via Boost wrote:

> I've been working on a Unicode library for submission to Boost, with an eye
> toward standardizing robust Unicode support for C++.
>
> It started as a better string library for namespace "std2", with minimal
> Unicode support.  Though "std2" may never happen, those string types are
> still in there, and the library has grown to also include all the Unicode
> features most users will ever need.
>
> You can find the Github page here: https://github.com/tzlaine/text
>
> You can find the online docs here: https://tzlaine.github.io/text
>
> If you care about portable Unicode support, or even addressing the
> embarrassment of being the only major production language with next to no
> Unicode support, please have a look and provide feedback.
>
> I gave a talk about this at C++Now in May, though it's a bit out of date,
> as the library was not then finished.  It's three hours, so, y'know, maybe
> skip it.  For completeness' sake:
>
> https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH
> https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8

I think a Unicode library is very much needed in Boost.

Out of curiosity, it looks like you implemented Unicode algorithms
yourself. Why not use a specialized library, like ICU?

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
On Sun, Sep 23, 2018 at 2:57 AM Andrey Semashev via Boost
<[hidden email]> wrote:
> Why not use a specialized library, like ICU?

The moment I see that a potential library or application uses ICU I
give it a hard pass.

Regards

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Sun, Sep 23, 2018 at 4:57 AM Andrey Semashev via Boost <
[hidden email]> wrote:
>
> On 9/23/18 7:45 AM, Zach Laine via Boost wrote:
>
> I think a Unicode library is very much needed in Boost.
>
> Out of curiosity, it looks like you implemented Unicode algorithms
> yourself. Why not use a specialized library, like ICU?

It's partly a question of the size of ICU, which is several megabytes,
whereas Boost.Text is only 1.2-2MB depending on your compiler.

I built HEAD of ICU just now, and here are the resulting .so's:

-rwxrwxr-x 1 tzlaine tzlaine  26M Sep 23 10:29 ./lib/libicudata.so.62.1
-rwxrwxr-x 1 tzlaine tzlaine 3.6M Sep 23 10:28 ./lib/libicui18n.so.62.1
-rwxrwxr-x 1 tzlaine tzlaine  65K Sep 23 10:28 ./lib/libicuio.so.62.1
-rwxrwxr-x 1 tzlaine tzlaine  66K Sep 23 10:28 ./lib/libiculx.so.62.1
-rwxrwxr-x 1 tzlaine tzlaine 234K Sep 23 10:28 ./lib/libicutu.so.62.1
-rwxrwxr-x 1 tzlaine tzlaine 2.2M Sep 23 10:28 ./lib/libicuuc.so.62.1
-rwxrwxr-x 1 tzlaine tzlaine 5.3K Sep 23 10:28 ./stubdata/libicudata.so.62.1
-rwxrwxr-x 1 tzlaine tzlaine  83K Sep 23 10:28
./tools/ctestfw/libicutest.so.62.1

So, I don't know how many of those you need, but if you require data (and
you do!), 26MB is a lot.  Note that I put collation data into headers, so
your runtime memory footprint might be much larger than 1.2-2MB, but the
minimum requirement is still only that small.  Requiring the user to pay
more than this minimum is a classic "Don't pay for what you don't use"
violation.

Another thing is that ICU allocates memory all over the place, in some
cases needlessly.

ICU also has IMO a poor (too complicated and confusing) API; there are way
too many types and functions, and the types that are emphasized are often
the wrong ones, like UTF-16 strings.  The algorithms should be C++-style
algorithms if this is something we're going to standardize.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
Zach Laine wrote:

> You can find the online docs here: https://tzlaine.github.io/text

I find the "string" layer a hard sell. First, realistically, nobody is going
to use it over std::string, especially when its selling point is "we make
your code not compile by removing functions from std::string". Second, some
of the removed functions are part of the Sequence requirements. Hard to see
the benefits of that removal; string<Ch> and vector<Ch> being compatible on
a concept level is useful.

This of course in no way diminishes the utility of the library. If its
opinionated `string` is part of the price of admission, so be it. I'm just
saying. :-)


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
On Sun, Sep 23, 2018 at 10:40 AM Peter Dimov via Boost <
[hidden email]> wrote:

> Zach Laine wrote:
>
> > You can find the online docs here: https://tzlaine.github.io/text
>
> I find the "string" layer a hard sell. First, realistically, nobody is
> going
> to use it over std::string, especially when its selling point is "we make
> your code not compile by removing functions from std::string". Second,
> some
> of the removed functions are part of the Sequence requirements. Hard to
> see
> the benefits of that removal; string<Ch> and vector<Ch> being compatible
> on
> a concept level is useful.
>

string is not and probably never will be a SequenceContainer, but I take
your point about text::string being a breaking change.  The original
impetus for the whole library was a rethink of 'std::string' for a possible
'std2::string'.  'std2' is probably DOA, given LEWG's over-my-dead-body
reaction to the idea.  So, the string layer stuff is still there, but it's
usefulness is now probably restricted to its interoperation with
unencoded_rope.


> This of course in no way diminishes the utility of the library. If its
> opinionated `string` is part of the price of admission, so be it. I'm just
> saying. :-)
>

Other string types, including std::string are interoperable with most of
Boost.Text, via concept-accepting overloads in the string- and text-layer
types.  Also, you can get away with never using text::string at all if you
want, for instance if you only use the text-layer types and/or the Unicode
layer.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
Zach Laine wrote:

> So, the string layer stuff is still there, but it's usefulness is now
> probably restricted to its interoperation with unencoded_rope.

Also, `string` stealing the buffer of `string_builder`, which can't be
implemented in terms of std::string.

Either way, I still think that removing push_back is taking things a bit too
far.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 9/23/18 11:37 AM, Zach Laine via Boost wrote:

> On Sun, Sep 23, 2018 at 4:57 AM Andrey Semashev via Boost <
> [hidden email]> wrote:
>> On 9/23/18 7:45 AM, Zach Laine via Boost wrote:
>>
>> I think a Unicode library is very much needed in Boost.
>>
>> Out of curiosity, it looks like you implemented Unicode algorithms
>> yourself. Why not use a specialized library, like ICU?
> It's partly a question of the size of ICU, which is several megabytes,
> whereas Boost.Text is only 1.2-2MB depending on your compiler.


Ideally, a "Unicode library for Boost" would offer an API, and the
question of what backend is used would be an implementation detail.
While I'm very enthusiastic about proper Unicode support being added to
C++, I have a hard time with the tendency in the Boost community to
reinvent wheels, i.e. the NIH syndrome. A good API / library design
should allow me to plug in existing implementations (for standard
functionality that's already implemented many times before), as a matter
of code reuse and maintainability.

Best,


Stefan

--

      ...ich hab' noch einen Koffer in Berlin...
   


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 9/22/18 9:45 PM, Zach Laine via Boost wrote:
> I've been working on a Unicode library for submission to Boost, with an eye
> toward standardizing robust Unicode support for C++.

Hmmm isn't there a lot of overlap with Boost.Locale:

https://www.boost.org/doc/libs/1_48_0/libs/locale/doc/html/charset_handling.html

Also, in boost detail there's a UTF facet which has been in use for
many, many years.  What would be the relationship with that?

Robert Ramey

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
> Either way, I still think that removing push_back is taking things a bit
> too far.

What is the difference between `text::string_view` and `std::string_view`?
(We already have `boost::string_view` in Utility too.)


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
Am 23.09.2018 um 17:55 schrieb Stefan Seefeld via Boost:

> On 9/23/18 11:37 AM, Zach Laine via Boost wrote:
>> On Sun, Sep 23, 2018 at 4:57 AM Andrey Semashev via Boost <
>> [hidden email]> wrote:
>>> On 9/23/18 7:45 AM, Zach Laine via Boost wrote:
>>>
>>> I think a Unicode library is very much needed in Boost.
>>> Why not use a specialized library, like ICU?
>> It's partly a question of the size of ICU, which is several megabytes,
>> whereas Boost.Text is only 1.2-2MB depending on your compiler.
>
> Ideally, a "Unicode library for Boost" would offer an API, and the
> question of what backend is used would be an implementation detail.

Right. For example, Windows 10 comes with ICU built in, ready for
consumption:
https://docs.microsoft.com/en-us/windows/desktop/intl/international-components-for-unicode--icu-

Ciao
  Dani

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 9/23/18 6:37 PM, Zach Laine via Boost wrote:

> On Sun, Sep 23, 2018 at 4:57 AM Andrey Semashev via Boost <
> [hidden email]> wrote:
>>
>> On 9/23/18 7:45 AM, Zach Laine via Boost wrote:
>>
>> I think a Unicode library is very much needed in Boost.
>>
>> Out of curiosity, it looks like you implemented Unicode algorithms
>> yourself. Why not use a specialized library, like ICU?
>
> It's partly a question of the size of ICU, which is several megabytes,
> whereas Boost.Text is only 1.2-2MB depending on your compiler.
>
> I built HEAD of ICU just now, and here are the resulting .so's:
>
> -rwxrwxr-x 1 tzlaine tzlaine  26M Sep 23 10:29 ./lib/libicudata.so.62.1
> -rwxrwxr-x 1 tzlaine tzlaine 3.6M Sep 23 10:28 ./lib/libicui18n.so.62.1
> -rwxrwxr-x 1 tzlaine tzlaine  65K Sep 23 10:28 ./lib/libicuio.so.62.1
> -rwxrwxr-x 1 tzlaine tzlaine  66K Sep 23 10:28 ./lib/libiculx.so.62.1
> -rwxrwxr-x 1 tzlaine tzlaine 234K Sep 23 10:28 ./lib/libicutu.so.62.1
> -rwxrwxr-x 1 tzlaine tzlaine 2.2M Sep 23 10:28 ./lib/libicuuc.so.62.1
> -rwxrwxr-x 1 tzlaine tzlaine 5.3K Sep 23 10:28 ./stubdata/libicudata.so.62.1
> -rwxrwxr-x 1 tzlaine tzlaine  83K Sep 23 10:28
> ./tools/ctestfw/libicutest.so.62.1
>
> So, I don't know how many of those you need, but if you require data (and
> you do!), 26MB is a lot.  Note that I put collation data into headers, so
> your runtime memory footprint might be much larger than 1.2-2MB, but the
> minimum requirement is still only that small.  Requiring the user to pay
> more than this minimum is a classic "Don't pay for what you don't use"
> violation.

Runtime memory footprint is actually more important. If I have 10
processes running on the machine that use ICU then I'm only paying its
price once while in your case I would be paying it 10 times. Given that
ICU is rather well adopted, this is not an unrealistic benefit. So, if
not using ICU you may want to consider if at least some of the runtime
data can be put in constant sections of a shared library.

> ICU also has IMO a poor (too complicated and confusing) API; there are way
> too many types and functions, and the types that are emphasized are often
> the wrong ones, like UTF-16 strings.  The algorithms should be C++-style
> algorithms if this is something we're going to standardize.

Its API could be wrapped inside your library so that users never have to
interface with it directly.

Nevertheless, thanks for the answer, and I still think a Unicode library
like yours is very much needed.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Sun, Sep 23, 2018 at 10:55 AM Peter Dimov via Boost <
[hidden email]> wrote:

> Zach Laine wrote:
>
> > So, the string layer stuff is still there, but it's usefulness is now
> > probably restricted to its interoperation with unencoded_rope.
>
> Also, `string` stealing the buffer of `string_builder`, which can't be
> implemented in terms of std::string.
>

True, but I don't know how to implement a string_builder that interoperates
with std::string (without standardizing it, of course).


> Either way, I still think that removing push_back is taking things a bit
> too
> far.
>

Fair enough.  As crazy as it sounds, I had to add resize() at some point
too.  I may have gone too far, as you say. :)

However, it would still be my preference that SequenceContainer support
front(), back(), and push_back() as algorithms, not members.  I see no
value in dragging that member API around with us for every new
SequenceContainer we introduce -- at least not for new code.  Having those
functions as algorithms still allows them to be used generically.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Sun, Sep 23, 2018 at 10:55 AM Stefan Seefeld via Boost <
[hidden email]> wrote:

> On 9/23/18 11:37 AM, Zach Laine via Boost wrote:
> > On Sun, Sep 23, 2018 at 4:57 AM Andrey Semashev via Boost <
> > [hidden email]> wrote:
> >> On 9/23/18 7:45 AM, Zach Laine via Boost wrote:
> >>
> >> I think a Unicode library is very much needed in Boost.
> >>
> >> Out of curiosity, it looks like you implemented Unicode algorithms
> >> yourself. Why not use a specialized library, like ICU?
> > It's partly a question of the size of ICU, which is several megabytes,
> > whereas Boost.Text is only 1.2-2MB depending on your compiler.
>
>
> Ideally, a "Unicode library for Boost" would offer an API, and the
> question of what backend is used would be an implementation detail.
> While I'm very enthusiastic about proper Unicode support being added to
> C++, I have a hard time with the tendency in the Boost community to
> reinvent wheels, i.e. the NIH syndrome. A good API / library design
> should allow me to plug in existing implementations (for standard
> functionality that's already implemented many times before), as a matter
> of code reuse and maintainability.
>

I agree with this in the abstract.  In this case, I don't know of any back
end that would work except for ICU.  As for having been implemented many
times, I'm not aware of any other implementations of all the named Unicode
algorithms besides ICU.  My hope is that my implementation is more
palatable to most users than the ICU one.  It certainly is for me.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Sun, Sep 23, 2018 at 11:11 AM Peter Dimov via Boost <
[hidden email]> wrote:

> > Either way, I still think that removing push_back is taking things a bit
> > too far.
>
> What is the difference between `text::string_view` and `std::string_view`?
> (We already have `boost::string_view` in Utility too.)
>

At the highest level of abstraction, there is no distinction.  As you look
at implementation details, though, you see that text::string_view is
stripped down in the same way that test::string is, and that it
interoperates with the text-layer types gracefully.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Sun, Sep 23, 2018 at 11:04 AM Robert Ramey via Boost <
[hidden email]> wrote:

> On 9/22/18 9:45 PM, Zach Laine via Boost wrote:
> > I've been working on a Unicode library for submission to Boost, with an
> eye
> > toward standardizing robust Unicode support for C++.
>
> Hmmm isn't there a lot of overlap with Boost.Locale:
>
>
> https://www.boost.org/doc/libs/1_48_0/libs/locale/doc/html/charset_handling.html


Not that I can tell.  They both operate using UTF encodings, but as I
understand it Boost.Locale concerns itself heavily (exclusively?) with
iostreams, whereas Boost.Text has no relation to iostreams except for a few
stream inserters.

Also, in boost detail there's a UTF facet which has been in use for
> many, many years.  What would be the relationship with that?
>

None at all.

I should note that there are two or three different UTF-8 <-> UTF-32
standalone transcoding implementations in Boost, but as far as I can tell
(I only tried with two of them) none of them produces encoding errors in
the manner (replacement character, not exception) and locations within the
code units stream recommended by Unicode.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Interest in Unicode library for Boost?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 23 September 2018 at 16:37, Zach Laine via Boost <[hidden email]>
wrote:

>
> It's partly a question of the size of ICU, which is several megabytes,
> whereas Boost.Text is only 1.2-2MB depending on your compiler.
>

The Unicode library I did as a SoC project in 2009 was significantly
smaller than that and if I recall correctly it has more data than the one
in your library.
Clearly some work can be done here to better optimize the database size.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost