Interest in a Unicode library for Boost?

classic Classic list List threaded Threaded
33 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Interest in a Unicode library for Boost?

Boost - Users mailing list
About 14 months ago I posted the same thing.  There was significant work that needed to be done to Boost.Text (the proposed library), and I was a bit burned out.

Now I've managed to make the necessary changes, and I feel the library is ready for review, if there is interest.

This library, in part, is something I want to standardize.

It started as a better string library for namespace "std2", with minimal Unicode support.  Though "std2" will almost certainly never happen now, those string types are still in there, and the library has grown to also include all the Unicode features most users will ever need.

Github: https://github.com/tzlaine/text
Online docs: https://tzlaine.github.io/text

If you care about portable Unicode support, or even addressing the embarrassment of being the only major production language with next to no Unicode support, please have a look and provide feedback.

I gave a talk about this at C++Now in May 2018, and now it's a bit out of date, as the library was not then finished.  It's three hours, so, y'know, maybe skip it.  For completeness' sake:

https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH
https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8

Zach


_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
On 10/25/19 6:11 PM, Zach Laine via Boost-users wrote:

> About 14 months ago I posted the same thing.  There was significant work
> that needed to be done to Boost.Text (the proposed library), and I was a
> bit burned out.
>
> Now I've managed to make the necessary changes, and I feel the library
> is ready for review, if there is interest.
>
> This library, in part, is something I want to standardize.
>
> It started as a better string library for namespace "std2", with minimal
> Unicode support.  Though "std2" will almost certainly never happen now,
> those string types are still in there, and the library has grown to also
> include all the Unicode features most users will ever need.
>
> Github: https://github.com/tzlaine/text
> Online docs: https://tzlaine.github.io/text
>
> If you care about portable Unicode support, or even addressing the
> embarrassment of being the only major production language with next to
> no Unicode support, please have a look and provide feedback.
>
> I gave a talk about this at C++Now in May 2018, and now it's a bit out
> of date, as the library was not then finished.  It's three hours, so,
> y'know, maybe skip it.  For completeness' sake:
>
> https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH
> https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8
>
> Zach

How is this related to Boost.Locale ? Conflict/Complement or ???

Robert Ramey



_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
In reply to this post by Boost - Users mailing list
On 26.10.19 03:11, Zach Laine via Boost-users wrote:
> If you care about portable Unicode support, or even addressing the
> embarrassment of being the only major production language with next to no
> Unicode support, please have a look and provide feedback.

I can't see myself using the string layer at all.  My codebase is too
deeply linked to std::string, as is the standard library, and a fair
number of third-party libraries I am using.  Also, the primary advantage
of the string layer seems to be a narrower interface, which is not an
advantage at all to me as a user.  std::string::find may be bad design,
but it doesn't hurt me, it just makes finding elements in a string
slightly more convenient.

I am very much interested in the unicode layer.  I am currently using
ICU, and I'd really like to remove this dependency.  ICU is big, it's
difficult to build, and I'm stuck on an older version because of
compatibility issues.

As for the text layer, the fact that it uses FCC means that I probably
won't use it because I have standardized on NFD.


--
Rainer Deyke ([hidden email])

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
In reply to this post by Boost - Users mailing list
It is unrelated.

Zach

On Fri, Oct 25, 2019, 11:08 PM Robert Ramey via Boost-users <[hidden email]> wrote:
On 10/25/19 6:11 PM, Zach Laine via Boost-users wrote:
> About 14 months ago I posted the same thing.  There was significant work
> that needed to be done to Boost.Text (the proposed library), and I was a
> bit burned out.
>
> Now I've managed to make the necessary changes, and I feel the library
> is ready for review, if there is interest.
>
> This library, in part, is something I want to standardize.
>
> It started as a better string library for namespace "std2", with minimal
> Unicode support.  Though "std2" will almost certainly never happen now,
> those string types are still in there, and the library has grown to also
> include all the Unicode features most users will ever need.
>
> Github: https://github.com/tzlaine/text
> Online docs: https://tzlaine.github.io/text
>
> If you care about portable Unicode support, or even addressing the
> embarrassment of being the only major production language with next to
> no Unicode support, please have a look and provide feedback.
>
> I gave a talk about this at C++Now in May 2018, and now it's a bit out
> of date, as the library was not then finished.  It's three hours, so,
> y'know, maybe skip it.  For completeness' sake:
>
> https://www.youtube.com/watch?v=944GjKxwMBo&index=7&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH
> https://www.youtube.com/watch?v=GJ2xMAqCZL8&list=PL_AKIMJc4roVSbTTfHReQTl1dc9ms0lWH&index=8
>
> Zach

How is this related to Boost.Locale ? Conflict/Complement or ???

Robert Ramey



_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
In reply to this post by Boost - Users mailing list

On Sat, Oct 26, 2019, 12:41 AM Rainer Deyke via Boost-users <[hidden email]> wrote:
On 26.10.19 03:11, Zach Laine via Boost-users wrote:
> If you care about portable Unicode support, or even addressing the
> embarrassment of being the only major production language with next to no
> Unicode support, please have a look and provide feedback.

I can't see myself using the string layer at all.  My codebase is too
deeply linked to std::string, as is the standard library, and a fair
number of third-party libraries I am using.  Also, the primary advantage
of the string layer seems to be a narrower interface, which is not an
advantage at all to me as a user. 

It is also a place to experiment with things like ropes and string builders.  I would like to standardize both, and I need a string that actually interoperates with those to show how they might work.

std::string::find may be bad design,
but it doesn't hurt me, it just makes finding elements in a string
slightly more convenient.

But it does hurt newcomers to the language, who must learn a slightly different API for string and string_view, and static_string and fixed_string if we get those.  It also hurts the standardization effort to review all those APIs.  You cannot use the std::string search algorithms on spans and other ranges or views either.

Returning -1 instead of the end index it's also pretty horrible.

If convenience is so paramount, why don't we add member sort () to vector?  This is not a troll, I would really like to know.  I want to find something in a vector or sort a vector about as often as I want to find a character or subsequence within a string.  What, to you, is the difference?  If there isn't one, please explain that too.

I am very much interested in the unicode layer.  I am currently using
ICU, and I'd really like to remove this dependency.  ICU is big, it's
difficult to build, and I'm stuck on an older version because of
compatibility issues.

As for the text layer, the fact that it uses FCC means that I probably
won't use it because I have standardized on NFD.

Completely understandable.

NFC, very close to FCC, is more popular, due to its compactness.  I picked the normalization form with the most readily available time and space optimizations, and then stuck to just that one -- the alternative is many text types with different normalizations having to interoperate, which sounds like hell.

Zac


_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
On 26.10.19 18:41, Zach Laine via Boost-users wrote:

> On Sat, Oct 26, 2019, 12:41 AM Rainer Deyke via Boost-users <
> [hidden email]> wrote:
>
>> On 26.10.19 03:11, Zach Laine via Boost-users wrote:
>>> If you care about portable Unicode support, or even addressing the
>>> embarrassment of being the only major production language with next to no
>>> Unicode support, please have a look and provide feedback.
>>
>> I can't see myself using the string layer at all.  My codebase is too
>> deeply linked to std::string, as is the standard library, and a fair
>> number of third-party libraries I am using.  Also, the primary advantage
>> of the string layer seems to be a narrower interface, which is not an
>> advantage at all to me as a user.
>
>
> It is also a place to experiment with things like ropes and string
> builders.  I would like to standardize both, and I need a string that
> actually interoperates with those to show how they might work.
>
> std::string::find may be bad design,
>> but it doesn't hurt me, it just makes finding elements in a string
>> slightly more convenient.
>
> But it does hurt newcomers to the language, who must learn a slightly
> different API for string and string_view, and static_string and
> fixed_string if we get those.  It also hurts the standardization effort to
> review all those APIs.  You cannot use the std::string search algorithms on
> spans and other ranges or views either.

The issue isn't if your string is better than std::string.  The issue is
if your string provides of an improvement to justify switching from
std::string, after the time and effort spent learning std::string is
already spent.  If I want to not use std::string::find, I can simply not
use it.

> Returning -1 instead of the end index it's also pretty horrible.

Not sure I agree.

auto pos = some_long_expression().find('.');

// Clear, simple, obvious:
if (pos == std::string::npos) {
   ...
}

// Less clear, and I have to either evaluate the same expression twice
// or use an additional variable, possibly making an extra copy of the
// string in the process.
if (pos == some_long_expression().size()) {
}

> If convenience is so paramount, why don't we add member sort () to vector?

Because it would be inconvenient to change existing code from std::sort
to std::vector::sort, but also because my entire codebase contains only
8 calls to std::sort and at least two orders of magnitude as many calls
to std::string::[r]find.

For what it's worth, if I were back in the C++98 standards committee, I
would vote against the inclusion of std::string::find.  But that's not
the current situation.


--
Rainer Deyke ([hidden email])

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
On Sat, Oct 26, 2019 at 3:01 PM Rainer Deyke via Boost-users <[hidden email]> wrote:
On 26.10.19 18:41, Zach Laine via Boost-users wrote:
> On Sat, Oct 26, 2019, 12:41 AM Rainer Deyke via Boost-users <
> [hidden email]> wrote:
>
>> On 26.10.19 03:11, Zach Laine via Boost-users wrote:
>>> If you care about portable Unicode support, or even addressing the
>>> embarrassment of being the only major production language with next to no
>>> Unicode support, please have a look and provide feedback.
>>
>> I can't see myself using the string layer at all.  My codebase is too
>> deeply linked to std::string, as is the standard library, and a fair
>> number of third-party libraries I am using.  Also, the primary advantage
>> of the string layer seems to be a narrower interface, which is not an
>> advantage at all to me as a user.
>
>
> It is also a place to experiment with things like ropes and string
> builders.  I would like to standardize both, and I need a string that
> actually interoperates with those to show how they might work.
>
> std::string::find may be bad design,
>> but it doesn't hurt me, it just makes finding elements in a string
>> slightly more convenient.
>
> But it does hurt newcomers to the language, who must learn a slightly
> different API for string and string_view, and static_string and
> fixed_string if we get those.  It also hurts the standardization effort to
> review all those APIs.  You cannot use the std::string search algorithms on
> spans and other ranges or views either.

The issue isn't if your string is better than std::string.  The issue is
if your string provides of an improvement to justify switching from
std::string, after the time and effort spent learning std::string is
already spent.  If I want to not use std::string::find, I can simply not
use it.

> Returning -1 instead of the end index it's also pretty horrible.

Not sure I agree.

auto pos = some_long_expression().find('.');

// Clear, simple, obvious:
if (pos == std::string::npos) {
   ...
}

// Less clear, and I have to either evaluate the same expression twice
// or use an additional variable, possibly making an extra copy of the
// string in the process.
if (pos == some_long_expression().size()) {
}

> If convenience is so paramount, why don't we add member sort () to vector?

Because it would be inconvenient to change existing code from std::sort
to std::vector::sort, but also because my entire codebase contains only
8 calls to std::sort and at least two orders of magnitude as many calls
to std::string::[r]find.

For what it's worth, if I were back in the C++98 standards committee, I
would vote against the inclusion of std::string::find.  But that's not
the current situation.

Fair enough.  Like I said, the string stuff was originally added to explore what a std2::string might look like.  As of this writing, that's not really a thing that will happen.

Zach


_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
In reply to this post by Boost - Users mailing list
Le 26/10/2019 à 03:11, Zach Laine via Boost-users a écrit :

> About 14 months ago I posted the same thing.  There was significant work
> that needed to be done to Boost.Text (the proposed library), and I was a
> bit burned out.
>
> Now I've managed to make the necessary changes, and I feel the library
> is ready for review, if there is interest.
>
> This library, in part, is something I want to standardize.
>
> It started as a better string library for namespace "std2", with minimal
> Unicode support.  Though "std2" will almost certainly never happen now,
> those string types are still in there, and the library has grown to also
> include all the Unicode features most users will ever need.
>
> Github: https://github.com/tzlaine/text
> Online docs: https://tzlaine.github.io/text

I've read the intro on why is std::string so bad and I have to disagree
with many points.

1. The Fat Interface

In which way is std::string bloat? Of course some functions are probably
here as synonymous but to say it's bloat is kinda false. Just look at
Java's String numerous functions instead [0].

And I

2. The Missing Unicode Support

Yes, many newcomers may be surprised to see that a string "é" has a size
of 2 bytes (assuming UTF-8). But it's also the case of UTF-16 strings
which may have surrotage pairs...

UTF-8 is the way to go and effectively stored. One could argue that we
should have some utf8 iterators or things like that. But std::string is
still a good candidate for string manipulations.

3. Miscellaneous Limitations

Not thread-safe being an issue? Thanks god it is not. Imagine the
overhead of a threadsafe version of a string. The purpose of a library
is not to be threadsafe on every objects. This has to be on the user side.

That said, I really hope for a better unicode support in std:: in the
near future. Your library is well designed and API is clean, I hope it
could be added in Boost :-).

[0]: https://docs.oracle.com/javase/7/docs/api/java/lang/String.html
_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
On Mon, Oct 28, 2019 at 3:35 AM David Demelier via Boost-users <[hidden email]> wrote:
Le 26/10/2019 à 03:11, Zach Laine via Boost-users a écrit :
> About 14 months ago I posted the same thing.  There was significant work
> that needed to be done to Boost.Text (the proposed library), and I was a
> bit burned out.
>
> Now I've managed to make the necessary changes, and I feel the library
> is ready for review, if there is interest.
>
> This library, in part, is something I want to standardize.
>
> It started as a better string library for namespace "std2", with minimal
> Unicode support.  Though "std2" will almost certainly never happen now,
> those string types are still in there, and the library has grown to also
> include all the Unicode features most users will ever need.
>
> Github: https://github.com/tzlaine/text
> Online docs: https://tzlaine.github.io/text

I've read the intro on why is std::string so bad and I have to disagree
with many points.

1. The Fat Interface

In which way is std::string bloat? Of course some functions are probably
here as synonymous but to say it's bloat is kinda false. Just look at
Java's String numerous functions instead [0].

Comparing std::string to Java's string class is not doing std::string any favors.
 
And I

2. The Missing Unicode Support

Yes, many newcomers may be surprised to see that a string "é" has a size
of 2 bytes (assuming UTF-8). But it's also the case of UTF-16 strings
which may have surrotage pairs...

UTF-8 is the way to go and effectively stored. One could argue that we
should have some utf8 iterators or things like that. But std::string is
still a good candidate for string manipulations.

I agree that UTF-8 is the way to go (and as I think you've seen, the library reflects that).  However, UTF-8 encoding is only part of the story.  There is also normalization.  If you use UTF-8-in-std::strings, normalization will not be enforced.  (Neither will UTF-8 encoding, but that's less of a problem if you always intend to produce replacement characters for broken UTF-8.)  Most users will want a type that enforces normalization as a class invariant.  Those that do not have the tools -- the algorithms and iterators in the Unicode layer -- to do that in a std::string if they want.
 
3. Miscellaneous Limitations

Not thread-safe being an issue? Thanks god it is not. Imagine the
overhead of a threadsafe version of a string. The purpose of a library
is not to be threadsafe on every objects. This has to be on the user side.

I don't think all string types should be threadsafe, but having a threadsafe option is nice.  That was not an explicit goal of adding ropes, but it is a nice side-effect of the choice I made for how to implement the ropes in Boost.Text.
 
That said, I really hope for a better unicode support in std:: in the
near future. Your library is well designed and API is clean, I hope it
could be added in Boost :-).

Thanks, me too. :)

Zach 

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
In reply to this post by Boost - Users mailing list
On 26.10.2019 03:11, Zach Laine via Boost-users wrote:

> About 14 months ago I posted the same thing.  There was significant
> work that needed to be done to Boost.Text (the proposed library), and
> I was a bit burned out.
>
> Now I've managed to make the necessary changes, and I feel the library
> is ready for review, if there is interest.
>
> This library, in part, is something I want to standardize.
>
> It started as a better string library for namespace "std2", with
> minimal Unicode support.  Though "std2" will almost certainly never
> happen now, those string types are still in there, and the library has
> grown to also include all the Unicode features most users will ever need.
>
> Github: https://github.com/tzlaine/text
> Online docs: https://tzlaine.github.io/text
>
> If you care about portable Unicode support, or even addressing the
> embarrassment of being the only major production language with next to
> no Unicode support, please have a look and provide feedback.

Puuting an issue of standardization aside, I certainly would love to see
something like that included in Boost. After a quick read of you docs
(about an hour), I'm not sure I'm happy with all the choices you've made
(see some remarks below) but overall I see it as something I would use
in the future. As you wrote, Unicode is hard, even with a library like
this; nearly mission impossible without.

Few remarks, for all their worth:

- I've never seen std::string and thread (un)safety as an issue

- pattern if (x == npos) is now so common that is imho important to
preserve it

- for the sake of completeness the normalization type used at the text
level ought to be a policy parameter; although I do understand your
arguments against it I think it should be there even at the cost of
different text types being inoperable without conversions

- at the text level I'm not sure I'm willing to cope with different
fundamental text types; I just want to use boost::text::text, pretty
much the same as I use std::string as an alias to much more complex
class template; heck, even at the string layer I'd probably prefer
rope/contiguous concept to be a policy parameter to the same type template.

- views should be introduced as views and not mixed with rope/contiguous
fundamental types

Hats off for the excellent work, though!

Leon


_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
On Tue, Oct 29, 2019 at 5:11 AM Leon Mlakar via Boost-users <[hidden email]> wrote:
On 26.10.2019 03:11, Zach Laine via Boost-users wrote:
> About 14 months ago I posted the same thing.  There was significant
> work that needed to be done to Boost.Text (the proposed library), and
> I was a bit burned out.
>
> Now I've managed to make the necessary changes, and I feel the library
> is ready for review, if there is interest.
>
> This library, in part, is something I want to standardize.
>
> It started as a better string library for namespace "std2", with
> minimal Unicode support.  Though "std2" will almost certainly never
> happen now, those string types are still in there, and the library has
> grown to also include all the Unicode features most users will ever need.
>
> Github: https://github.com/tzlaine/text
> Online docs: https://tzlaine.github.io/text
>
> If you care about portable Unicode support, or even addressing the
> embarrassment of being the only major production language with next to
> no Unicode support, please have a look and provide feedback.

Puuting an issue of standardization aside, I certainly would love to see
something like that included in Boost. After a quick read of you docs
(about an hour), I'm not sure I'm happy with all the choices you've made
(see some remarks below) but overall I see it as something I would use
in the future. As you wrote, Unicode is hard, even with a library like
this; nearly mission impossible without.

Few remarks, for all their worth:

- I've never seen std::string and thread (un)safety as an issue

Fair enough.  As stated previously in this thread, the threadsafety feature is a side effect that comes from the copy-on-write semantics of rope.  *That* is the reason rope is designed the way it is, not the threadsafety part.  It just happens that the threadsafety part comes for free when you do the copy-on-write part.
 
- pattern if (x == npos) is now so common that is imho important to
preserve it

The std::string/std::string_view API is the only place in the STL where the algorithms do not return the end of the half-open input range on failure.  That's really wonky.  I don't care about preserving it.
 
- for the sake of completeness the normalization type used at the text
level ought to be a policy parameter; although I do understand your
arguments against it I think it should be there even at the cost of
different text types being inoperable without conversions

I disagree.  Policy parameters are bad for reasoning.  If I see a text::text, as things currently stand, I know that it is stored as a contiguous array of UTF-8, and that it is normalized FCC.  If I add a template parameter to control the normalization, I change the invariants of the type.  Types with different invariants should have different names.  To do otherwise is a violation of the single responsibility principle.
 
- at the text level I'm not sure I'm willing to cope with different
fundamental text types; I just want to use boost::text::text, pretty
much the same as I use std::string as an alias to much more complex
class template; heck, even at the string layer I'd probably prefer
rope/contiguous concept to be a policy parameter to the same type template.

That would be like adding a template parameter to std::vector that makes it act like a std::deque for certain values of that parameter.  Changing the space and time complexity of a type by changing a template parameter is the wrong answer.
 
- views should be introduced as views and not mixed with rope/contiguous
fundamental types

That does not sound like what I want either, but I don't know what this refers to.  Could you be specific?

Zach


_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
On 30/10/2019 05:11, Zach Laine wrote:
>     - pattern if (x == npos) is now so common that is imho important to
>     preserve it
>
> The std::string/std::string_view API is the only place in the STL where
> the algorithms do not return the end of the half-open input range on
> failure.  That's really wonky.  I don't care about preserving it.

Returning end of range on failure is incredibly inconvenient (for the
consumer; granted it's usually more convenient for the algorithm
implementer), and I'd be happier if STL algorithms didn't do that either.

I see that as an unfortunate consequence of using generic iterators as
input parameters and return types, and not an otherwise desirable design
choice.

(ie. the STL algorithms do it because they couldn't do anything better.
string doesn't do it because it can do something better [since it knows
the iterator type and class, and can consequently choose to return
something other than an iterator].)

>     - for the sake of completeness the normalization type used at the text
>     level ought to be a policy parameter; although I do understand your
>     arguments against it I think it should be there even at the cost of
>     different text types being inoperable without conversions
>
>
> I disagree.  Policy parameters are bad for reasoning.  If I see a
> text::text, as things currently stand, I know that it is stored as a
> contiguous array of UTF-8, and that it is normalized FCC.  If I add a
> template parameter to control the normalization, I change the invariants
> of the type.  Types with different invariants should have different
> names.  To do otherwise is a violation of the single responsibility
> principle.

While I too dislike policy parameters as a general rule -- especially
defaulted policy parameters, since APIs have a tendency to only
implement one and not all (see: how many libraries use std::string
instead of being templated on std::basic_string, or use std::vector<T>
instead of being templated on an allocator)...

Technically speaking, a different policy parameter does form a different
type name and thus "types with different invariants should have
different names" is satisfied.
_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list


> On Oct 29, 2019, at 4:26 PM, Gavin Lambert via Boost-users <[hidden email]> wrote:

> Returning end of range on failure is incredibly inconvenient (for the consumer; granted it's usually more convenient for the algorithm implementer), and I'd be happier if STL algorithms didn't do that either.

“incredibly inconvenient”?

Is it possible that you are over stating your case?


_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
On 30/10/2019 15:03, Jon Kalb wrote:
>> On Oct 29, 2019, at 4:26 PM, Gavin Lambert wrote:
>
>> Returning end of range on failure is incredibly inconvenient (for the consumer; granted it's usually more convenient for the algorithm implementer), and I'd be happier if STL algorithms didn't do that either.
>
> “incredibly inconvenient”?
>
> Is it possible that you are over stating your case?

Granted most existing algorithms require making two calls to the
collection to get .begin() and .end(), which requires assigning the
collection to some lvalue -- and once you've done that, the
inconvenience is small (though "== list.end()" is still a bit ugly).

But once you start working with range-based rather than iterator-based
algorithms, it happens a lot more frequently that your collection is an
rvalue that you don't want to have to assign to an lvalue -- but you end
up having to do so just so that you can get its .end() to check for
failure.  Or you end up writing a helper method just so that you can
have a named parameter lvalue without cluttering the original source.

(Already cited in this thread was a similar example for string rvalues,
where npos was more convenient than end().  Granted strings are more
often rvalues than collections are, but the principle applies to both.)


I'm sure many people have written helper methods to avoid having to
write "map.find(key) == map.end()" patterns repeatedly.

And for associative containers in particular, an interface based around
Optional or Outcome would be a lot more convenient than one based around
iterators.
_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
In reply to this post by Boost - Users mailing list
On Tue, Oct 29, 2019 at 6:26 PM Gavin Lambert via Boost-users <[hidden email]> wrote:
On 30/10/2019 05:11, Zach Laine wrote:
>     - pattern if (x == npos) is now so common that is imho important to
>     preserve it
>
> The std::string/std::string_view API is the only place in the STL where
> the algorithms do not return the end of the half-open input range on
> failure.  That's really wonky.  I don't care about preserving it.

Returning end of range on failure is incredibly inconvenient (for the
consumer; granted it's usually more convenient for the algorithm
implementer), and I'd be happier if STL algorithms didn't do that either.

I see that as an unfortunate consequence of using generic iterators as
input parameters and return types, and not an otherwise desirable design
choice.

(ie. the STL algorithms do it because they couldn't do anything better.
string doesn't do it because it can do something better [since it knows
the iterator type and class, and can consequently choose to return
something other than an iterator].)

I heartily disagree, but I'm also very curious about this.  As an example, could you take one of the simple std algorithms (std::find would be a very simple candidate), and show its definition in the style you have in mind?
 
>     - for the sake of completeness the normalization type used at the text
>     level ought to be a policy parameter; although I do understand your
>     arguments against it I think it should be there even at the cost of
>     different text types being inoperable without conversions
>
>
> I disagree.  Policy parameters are bad for reasoning.  If I see a
> text::text, as things currently stand, I know that it is stored as a
> contiguous array of UTF-8, and that it is normalized FCC.  If I add a
> template parameter to control the normalization, I change the invariants
> of the type.  Types with different invariants should have different
> names.  To do otherwise is a violation of the single responsibility
> principle.

While I too dislike policy parameters as a general rule -- especially
defaulted policy parameters, since APIs have a tendency to only
implement one and not all (see: how many libraries use std::string
instead of being templated on std::basic_string, or use std::vector<T>
instead of being templated on an allocator)...

Technically speaking, a different policy parameter does form a different
type name and thus "types with different invariants should have
different names" is satisfied.

Yes, you got me.  I was speaking loosely, and referred to a template as if it were a type.  What I should have added was that a template's single responsibility should be to stamp out types that all model the same concept.  A policy-based template has a hard time doing that.  A policy-based template that stamps out strings with different invariants does not do that at all.

Zach


_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
In reply to this post by Boost - Users mailing list
I'm reposting this - by mistake I've used "Reply" instead of "Reply To List" button. I apologize for the inconvenience.
 
- for the sake of completeness the normalization type used at the text
level ought to be a policy parameter; although I do understand your
arguments against it I think it should be there even at the cost of
different text types being inoperable without conversions

I disagree.  Policy parameters are bad for reasoning.  If I see a text::text, as things currently stand, I know that it is stored as a contiguous array of UTF-8, and that it is normalized FCC.  If I add a template parameter to control the normalization, I change the invariants of the type.  Types with different invariants should have different names.  To do otherwise is a violation of the single responsibility principle.

Okay, the policy or not the policy was not my point ... it was to allow for different underlying normalizations. Granted, it may only be important to (a few) corner cases where input and/or output normalizations are given, and your assessment that it may not be worth the effort is reasonable ... unless you are aiming towards adding to the standard. Then the completeness imho becomes more important.

Frankly, I'm not proficient enough in the meta-programming to make a strong case either for policy parameter or for explicit types/templates. I just happen to prefer the policy based approach.

 
- at the text level I'm not sure I'm willing to cope with different
fundamental text types; I just want to use boost::text::text, pretty
much the same as I use std::string as an alias to much more complex
class template; heck, even at the string layer I'd probably prefer
rope/contiguous concept to be a policy parameter to the same type template.

That would be like adding a template parameter to std::vector that makes it act like a std::deque for certain values of that parameter.  Changing the space and time complexity of a type by changing a template parameter is the wrong answer.
No, that is not making the std::vector to act as std::deque - the text would still remain the text and act as a text, with the same interface. It's more like FIFO implementation using either std::vector or std::dequeu for its store - since in both cases the FIFO has the same interface and functionally behaves the same, I really don't want two distinct types. The type template with the parameter that makes the choice between the underlying storage seems much more natural to me.
 
- views should be introduced as views and not mixed with rope/contiguous
fundamental types

That does not sound like what I want either, but I don't know what this refers to.  Could you be specific?

Well, I'll have to think more about it ... it struck me that the docs often mention X and X_view in the same sentence, and you have to go elsewhere to learn that one is owning and the other isn't. I hope I'll find some time in the next days and come back on this.

Cheers,

Leon



_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
In reply to this post by Boost - Users mailing list
On 10/30/19 1:03 AM, Zach Laine via Boost-users wrote:
>
> Yes, you got me.  I was speaking loosely, and referred to a template
> as if it were a type.  What I should have added was that a template's
> single responsibility should be to stamp out types that all model the
> same concept.  A policy-based template has a hard time doing that.  A
> policy-based template that stamps out strings with different
> invariants does not do that at all.
>
But all the various templates may well express a base concept even if
some of the invariants change between different template parameters. The
example of Unicode normalization, or memory allocator seem like perfect
examples of this. If an operation needs a particular normalization rule,
it specializes its parameter on that one case, otherwise it leaves it as
a template parameter.

To me, the basic invariant of a string is that it is a sequence of code
units that describe a textual object. Often the details of that encoding
are unimportant, it might be ASCII, it might be in some old code page,
it might be in UTF-8, it might be in UCS-4, and for the various Unicode
variations, there are different normalization rules to handle that a
given 'character' (aka Glyph) might be expressed in different ways, but
the routine largely doesn't care. When it does care, it can force the
string into that variant (or refuse some other variants), but the
purpose of templates is to condense duplicate code into a single piece
of code that only needs to be written once.

--
Richard Damon

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
In reply to this post by Boost - Users mailing list
On 26.10.19 18:41, Zach Laine via Boost-users wrote:
> NFC, very close to FCC, is more popular, due to its compactness.  I picked
> the normalization form with the most readily available time and space
> optimizations, and then stuck to just that one -- the alternative is many
> text types with different normalizations having to interoperate, which
> sounds like hell.

I can understand that, all other things being equal, the more compact
form might be preferable.  I mean, if you know nothing about Unicode
normalization forms other than that one is more compact than the other,
then you might as well pick the more compact one, right?

But all other things are clearly /not/ equal, or you would just use NFC.
  And the difference in compactness between NFC and NFD is completely
trivial.  I challenge you to find any real-world text where the
difference is size between NFC and NFD is big enough that I should care
about it, both in absolute and relative terms.

I consider FCC a non-solution to a non-problem.  The advantage of NFC
over NFD is not compactness, but compatibility with interfaces that
expect NFC.  Since FCC does not provide that advantage, there is no
reason to choose FCC over NFD.  On the other hand, there are several
good reasons for choosing NFD over FCC.  Aside from the obvious one -
compatibility with interfaces that expect NFD - there's also cleaner,
simpler code with fewer surprises.  For example, it is a completely
straightforward operation to replace all acute accents in a NFD text
with grave accents or to remove acute accents entirely, whereas the FCC
equivalent requires effectively transcoding to NFD.

In summary, I think you should support NFD text types.  Either in
addition to FCC or instead of it.


--
Rainer Deyke ([hidden email])

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
In reply to this post by Boost - Users mailing list
On Wed, Oct 30, 2019 at 6:55 AM Richard Damon via Boost-users <[hidden email]> wrote:
On 10/30/19 1:03 AM, Zach Laine via Boost-users wrote:
>
> Yes, you got me.  I was speaking loosely, and referred to a template
> as if it were a type.  What I should have added was that a template's
> single responsibility should be to stamp out types that all model the
> same concept.  A policy-based template has a hard time doing that.  A
> policy-based template that stamps out strings with different
> invariants does not do that at all.
>
But all the various templates may well express a base concept even if
some of the invariants change between different template parameters. The
example of Unicode normalization, or memory allocator seem like perfect
examples of this. If an operation needs a particular normalization rule,
it specializes its parameter on that one case, otherwise it leaves it as
a template parameter.

To me, the basic invariant of a string is that it is a sequence of code
units that describe a textual object. Often the details of that encoding
are unimportant, it might be ASCII, it might be in some old code page,
it might be in UTF-8, it might be in UCS-4, and for the various Unicode
variations, there are different normalization rules to handle that a
given 'character' (aka Glyph) might be expressed in different ways, but
the routine largely doesn't care. When it does care, it can force the
string into that variant (or refuse some other variants), but the
purpose of templates is to condense duplicate code into a single piece
of code that only needs to be written once.

You're mixing kinds of abstractions here.  There is the genericity you find in a function that takes a generic parameter, and that's the kind of use based on concept you're talking about here.  About that you're 100% correct:

template<foo_concept T>
auto foo(T const & x); // <-- feel free to pass any type here that models foo_concept

Part of why the above code works is that foo() only uses x in certain ways, and anything that meets the syntactic requirements is well-formed.  If foo_concept describes a sequence container, I only care about the common interface of a sequence container.  Specifically, I cannot use vector::reserve(), and don't really care that it exists.

Where that breaks down is when you have not a function template that uses certain aspects of a type, but a class template that represents a set of types.  That case is different:

foo_template<T> foo; // <-- feel free to use the entire API

If the API is different for various values of T, such as it would be for a text template that instantiates as string-like or rope-like (because those have significantly different interfaces), that implies to me that I should have two names in play -- one for the string version and one for the rope version.  Otherwise, the result is super confusing for someone reading or writing code using the unified name.

Zach

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: Interest in a Unicode library for Boost?

Boost - Users mailing list
In reply to this post by Boost - Users mailing list
On Wed, Oct 30, 2019 at 8:03 AM Rainer Deyke via Boost-users <[hidden email]> wrote:
On 26.10.19 18:41, Zach Laine via Boost-users wrote:
> NFC, very close to FCC, is more popular, due to its compactness.  I picked
> the normalization form with the most readily available time and space
> optimizations, and then stuck to just that one -- the alternative is many
> text types with different normalizations having to interoperate, which
> sounds like hell.

I can understand that, all other things being equal, the more compact
form might be preferable.  I mean, if you know nothing about Unicode
normalization forms other than that one is more compact than the other,
then you might as well pick the more compact one, right?

But all other things are clearly /not/ equal, or you would just use NFC.
  And the difference in compactness between NFC and NFD is completely
trivial.  I challenge you to find any real-world text where the
difference is size between NFC and NFD is big enough that I should care
about it, both in absolute and relative terms.

I consider FCC a non-solution to a non-problem.  The advantage of NFC
over NFD is not compactness, but compatibility with interfaces that
expect NFC.  Since FCC does not provide that advantage, there is no
reason to choose FCC over NFD.  On the other hand, there are several
good reasons for choosing NFD over FCC.  Aside from the obvious one -
compatibility with interfaces that expect NFD - there's also cleaner,
simpler code with fewer surprises.  For example, it is a completely
straightforward operation to replace all acute accents in a NFD text
with grave accents or to remove acute accents entirely, whereas the FCC
equivalent requires effectively transcoding to NFD.

In summary, I think you should support NFD text types.  Either in
addition to FCC or instead of it.

NFD is not an unreasonable choice, though I don't know why you'd want to do a search-replace that changes all het accents from acute to grave (is that a real use-case, or just a for-instance?).  Unfortunately, the fast-path of the collation algorithm implementation requires FCC, which is why ICU uses it, and one of the main reasons why I picked it.  If we had NFD strings, we'd have to normalize them to FCC first, if I'm not mistaken.  (Though I should verify that with a test.)

Zach

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
12