[review] [text] Text formal review

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[review] [text] Text formal review

Boost - Dev mailing list
Dear Boost community,

The formal review of Zach Laine's Unicode library, Text, begins today
(June 11th) and ends on June 20th.

The library provides three layers:
 - The string layer, a set of types that constitute "a better std::string"
 - The Unicode layer, consisting of the Unicode algorithms and data
 - The text layer, a set of types like the string layer types, but
providing transparent Unicode support

Code: https://github.com/tzlaine/text
Docs: https://tzlaine.github.io/text

The master branch of the repository will be frozen for the duration of
the review.

Text requires C++14 or higher, and supports GCC 5+, Clang 3.6+, and MSVC 2017.

Zach initially presented the library at C++Now in 2018. He also
intends to standardize a library solution for Unicode based on this
work.

    Talk: https://www.youtube.com/watch?v=944GjKxwMBo
    Slides: https://tinyurl.com/boost-text-slides

We encourage your participation in this review. At a minimum, kindly state:
- Whether you believe the library should be accepted into Boost
  * Any conditions for acceptance
- Your name and knowledge of the problem domain.

You are strongly encouraged to also provide additional information:
- What is your evaluation of the library's:
  * Design
  * Implementation
  * Documentation
  * Tests
  * Usefulness
- Did you attempt to use the library? If so:
  * Which compiler(s)
  * What was the experience? Any problems?
- How much effort did you put into your evaluation of the review?

Even if you are not submitting a formal review, your feedback on the
library is welcome.

Glen

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
On 11/06/2020 19:37, Glen Fernandes via Boost wrote:

> The library provides three layers:
>  - The string layer, a set of types that constitute "a better std::string"
>  - The Unicode layer, consisting of the Unicode algorithms and data
>  - The text layer, a set of types like the string layer types, but
> providing transparent Unicode support

Firstly, I'd like to say that proposing a new string implementation is
probably one of the most masochistic things that you can do in C++. Even
more than proposing a result<T, E> type. So, I take a bow to you Mr.
Laine, and I salute your bravery.

I'll put aside the Unicode and Text layers for now, and just consider
the String layer. I have to admit, I'm not keen on the string layer.
Firstly I hate the naming. Everything there ought to get more
descriptive naming. But more importantly, how the design of the string
implementation has been broken up and designed, there's just something
about it which doesn't sit right with me. You seem to me to have
prioritised certain use cases over others which would not be my own
choices i.e. I don't think the balance of tradeoffs is right in there.
For example, I wouldn't have chosen an atomically reference counted rope
design the way you have at all: I'd have gone for a fusion of your
static string builder with constexpr-possible (i.e. non-atomic)
reference counted fragments, using expression templates to lazily
canonicalise the string depending on sink (e.g. if the sink is cout, use
gather i/o sequence instead of creating a new string). That sort of thing.

Zach, could you take this opportunity to compare your choice of string
design with the string designs implemented by each of the following
libraries please?

- LLVM strings, string refs, twines etc.

- Abseil's strings, string pieces.

- Folly's strings, string pieces and ranges.

- CopperSpice's CsString.

I feel like I am forgetting at least another two. But, point is, I'd
like to know why you chose a different design to each of the above, in
those situations where you did so.


I'll nail my own colours to the mast on this topic: I've thought about
this long and hard over many many years, and I've personally arrived on
the opinion that C needs to gain an integral string object built into
the language, which builds on top of an integral variably sized array
object (NOT current C VLAs). Said same built-in string object would also
be available to C++, by definition.

I have arrived at this opinion because I don't think that ANY library
solution can have the right balance of tradeoffs between all the
competing factors. I think that only a built-in object to the language
itself can deliver the perfect string object, because only the compiler
can deliver a balance of optimisability with developer convenience.

I won't go into any more detail, as this is a review of the Text C++
library. And I know I've already discussed my opinion on SG16 where you
Zach were present, so you've heard all my thoughts on this already.
However, if you were feeling keen, I'd like to know if you could think
of any areas where language changes would aid implementing better
strings in C++?

Niall

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
On Fri, Jun 12, 2020 at 6:40 AM Niall Douglas via Boost
<[hidden email]> wrote:
> Zach, could you take this opportunity to compare your choice of string
> design with the string designs implemented by each of the following
> libraries please?
>
> - LLVM strings, string refs, twines etc.
> - Abseil's strings, string pieces.
> - Folly's strings, string pieces and ranges.
> - CopperSpice's CsString.

There's also boost::json::string, which resembles std::string except that:

* It is not a class template.

* Most of the function definitions are in a library TU rather than a header.

* It uses boost::json::storage_ptr instead of Allocator or
std::pmr::memory_resource*

* Some function signatures have been collapsed or adjusted for
simplicity, in ways
  that would be desirable for std::string except that it would break ABI.

json::string:

<https://github.com/CPPAlliance/json/blob/f47fbeb1a54b832221a35c17912febb7a2c3a317/include/boost/json/string.hpp#L57>

json::storage_ptr

<http://vinniefalco.github.io/doc/json/json/usage/allocators.html>
<http://vinniefalco.github.io/doc/json/json/ref/boost__json__storage_ptr.html>

Regards

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
My name is Rainer Deyke.  My experience of the problem domain is mostly
as a user of other unicode libraries, ICU in particular.  Here is my
formal review of Boost.Text.

> - What is your evaluation of the library's:
>  * Design

String layer: overall, a solution looking for a problem.

The use of 'int' as the size_type in string/string_view is a time
bomb waiting to go off.  I don't /typically/ have strings longer than
2GB, but I do use some text-based file formats for which there is no
/inherent/ reason why the individual files can't exceed 2GB.  (And no,
unencoded_rope would not be a better choice.  I can't memmap ropes, but
I can memmap string_views.)

I like the use of negative indices for indexing from the end in
Python, but I am ambivalent about using the same feature in C++.  None
of the other types I regularly use in C++ work like that, and the
runtime cost involved is a lot more noticeable in C++.  Also, using the
same type of counting-from-end indices and counting-from-beginning
indices seems unsafe.  A separate type for counting-from-end would be
safer and faster, at the cost of being more syntactically heavy.

My first impression for unencoded_rope_view was that it's useless.
Then I saw the undo/redo example for segmented_vector, and I have since
amended my opinion to "mostly useless".  To clarify:
   - In the most common use for strings, where the string is created
once, used, and discarded without ever being modified, unencoded_rope is
worse than a contiguous string: slower lookup, slower traversal, and
worse data locality.
   - The most common forms of "modification", in my experience, are
most efficiently done by reconstructing the entire string, which again
makes unencoded_rope useless.  (And concatenation, I guess, for which
unencoded_rope actually great.)
   - Even if my modifications are primarily of the type for which
unencoded_rope is optimized, there would need to be a fairly high
modification-to-read-operation ratio for unencoded_string to be a net win.
   - But the fact that copies on unencoded_ropes are cheap, and the
fact that these copies remain cheap when one copy is modified, makes
unencoded_rope very useful for implementing undo/redo in a text
editor operating on very large text files.

Unicode layer: overall, very useful.

One things that appears to be missing is a normalization-preserving
append/insert/erase operators.  These are available in the text layer,
but that means being tied to the specific text classes provided by that
layer.

Requiring unicode text to be in Stream-Safe Format is another time bomb
waiting to go off, but it's also usability issue.  The library should
provide an algorithm to put unicode text in Stream-Safe Format, and
should automatically apply that algorithm whenever text is normalized.
This would make it safe to use Boost.Text on data from an untrusted
source so long as the data is normalized first, which you have to do
with untrusted data anyway.

Text layer: overall, I don't like it.

On one hand, there is the gratuitous restriction to FCC.  Why can't
other normalization forms be supported, given that the unicode layer
supports them?

On the other hand, there is the restriction to the base types from the
string layer.  Why can't 'text' be a class template taking the
underlying container class as an argument?  (Including std::string?)

I have a lot of code, much of it in third-party libraries (such as
Boost!), that uses std::string as a basic vocabulary type.  Converting
between Boost.Text strings and std::strings at all API boundaries is not
viable.  If I can't use the text layer with std::string, then I don't
want it.

> * Implementation

I didn't look at it in detail.

> * Documentation

Overall, very clear and useful.

The warning about undefined behavior if the data is not in Stream-Safe
Format should be at the top of every page to which it applies, in big
and bold text, on a red background, ideally blinking.  Or, the library
could take measures to handle it reasonably safe for handling untrusted
data.

.../the_unicode_layer/collation.html: Unicode Technical Note #5
describes an /extension/ to the unicode collation algorithm which /also/
allows it to handle FCC /in addition to/ NFD.  If your collation
algorithm doesn't handle NFD correctly, then there is something
seriously wrong with it, and I except that it will also fail on the
corner cases of FCC.

Describing NFD as "memory-hogging" is FUD, as the actual difference in
memory usage between NFD and FCC is trivial for most real-world data.
(I suppose Korean Hangul would be a major exception to this.)

.../the_unicode_layer/searching.html: the note at the end of the
page is wrong, assuming you implemented the algorithms correctly.  The
concerns for searching NFD strings are similar to the concerns for
searching FCC strings.

In both FCC and NFD:
   - There is a distinction between A+grave+acute and A+acute+grave,
because they are not canonically equivalent.
   - A+grave is a partial grapheme match for A+grave+acute.
   - A+acute is not a partial grapheme match for A+grave+acute.
   - A+grave is not a partial grapheme match for A+acute+grave.
   - A+acute is a partial grapheme match for A+acute+grave.
But:
   - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.

> * Tests

I found it hard to get a good overview of the tests, seeing as a lot of
them are dynamically generated test files.

One thing I noticed is that there are extensive normalization tests for
the four basic unicode normalization forms, but none for FCC.

In the text_.cpp test file, there are some raw unicode code points
without either the actual character or its name, making it hard to see
what's going on.

In the text_.cpp test file, in the normalization tests, there are some
edge cases that do not seem to be tested:
   - Erasing a combining character such that another combining character
merges into the previous base character.  (Example: removing the cedilla
in a A+cedilla+ring_above sequence, yielding the precomposed
A_with_ring_above character.)
   - Inserting a combining character such that this causes a precomposed
character to decompose.  (Example: inserting a cedilla after
A_with_ring_above, yielding a A+cedilla+ring_above.)
   - Inserting a combining character that requires existing combining
characters to be reordered.

"empty" is misspelled as "emtpy" in text_.cpp.


> * Usefulness - Did you attempt to use the library?

I was not able to get the library to build, so I was not able to test
it.  But it does look like it should be a good ICU replacement for my
purposes, assuming it doesn't have any serious bugs.

My vote: ACCEPT the unicode layer, but only the unicode layer.  ABSTAIN
for the other layers, with the caveat that if they are accepted, the
issues I have raised should be addressed.


--
Rainer Deyke ([hidden email])


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Fri, Jun 12, 2020 at 8:40 AM Niall Douglas via Boost
<[hidden email]> wrote:

>
> On 11/06/2020 19:37, Glen Fernandes via Boost wrote:
>
> > The library provides three layers:
> >  - The string layer, a set of types that constitute "a better std::string"
> >  - The Unicode layer, consisting of the Unicode algorithms and data
> >  - The text layer, a set of types like the string layer types, but
> > providing transparent Unicode support
>
> Firstly, I'd like to say that proposing a new string implementation is
> probably one of the most masochistic things that you can do in C++. Even
> more than proposing a result<T, E> type. So, I take a bow to you Mr.
> Laine, and I salute your bravery.
>
> I'll put aside the Unicode and Text layers for now, and just consider
> the String layer. I have to admit, I'm not keen on the string layer.
> Firstly I hate the naming. Everything there ought to get more
> descriptive naming. But more importantly, how the design of the string
> implementation has been broken up and designed, there's just something
> about it which doesn't sit right with me. You seem to me to have
> prioritised certain use cases over others which would not be my own
> choices i.e. I don't think the balance of tradeoffs is right in there.
> For example, I wouldn't have chosen an atomically reference counted rope
> design the way you have at all: I'd have gone for a fusion of your
> static string builder with constexpr-possible (i.e. non-atomic)
> reference counted fragments, using expression templates to lazily
> canonicalise the string depending on sink (e.g. if the sink is cout, use
> gather i/o sequence instead of creating a new string). That sort of thing.
>
> Zach, could you take this opportunity to compare your choice of string
> design with the string designs implemented by each of the following
> libraries please?
>
> - LLVM strings, string refs, twines etc.
>
> - Abseil's strings, string pieces.
>
> - Folly's strings, string pieces and ranges.
>
> - CopperSpice's CsString.
>
> I feel like I am forgetting at least another two. But, point is, I'd
> like to know why you chose a different design to each of the above, in
> those situations where you did so.

No, because I don't honestly care about the string layer that much any
more.  It was originally a major reason -- the reason, really -- for
the library at the outset.  Now it's mostly cruft.  If people object
to it enough (and it seems they will), I can certainly remove it
entirely, except for unencoded_rope, which is needed in rope.
Replacing boost::text::string with std::string within
boost::text::text is straightforward, and will have no visible effect
on uses of text::text, except for extract() and replace().  The only
reason I left the string bits of the library in place when I changed
the focus to be Unicode-centric is that is was less work to do so.

> I'll nail my own colours to the mast on this topic: I've thought about
> this long and hard over many many years, and I've personally arrived on
> the opinion that C needs to gain an integral string object built into
> the language, which builds on top of an integral variably sized array
> object (NOT current C VLAs). Said same built-in string object would also
> be available to C++, by definition.
>
> I have arrived at this opinion because I don't think that ANY library
> solution can have the right balance of tradeoffs between all the
> competing factors. I think that only a built-in object to the language
> itself can deliver the perfect string object, because only the compiler
> can deliver a balance of optimisability with developer convenience.
>
> I won't go into any more detail, as this is a review of the Text C++
> library. And I know I've already discussed my opinion on SG16 where you
> Zach were present, so you've heard all my thoughts on this already.
> However, if you were feeling keen, I'd like to know if you could think
> of any areas where language changes would aid implementing better
> strings in C++?

I think the big thing for me would be to have language-level support
for discriminating between char * strings and string literals.  String
literals are special in certain ways that are useful to take advantage
of:  1) they are not necessary to copy, since they're in ROM; 2) they
are encoded by the compiler into the execution encoding used in phase
5 of translation.  This second one is pretty important to detect in
some cases, like making a printf-like upgrade to std::format() "just
work".

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Fri, Jun 12, 2020 at 12:36 PM Rainer Deyke via Boost
<[hidden email]> wrote:
>
> My name is Rainer Deyke.  My experience of the problem domain is mostly
> as a user of other unicode libraries, ICU in particular.  Here is my
> formal review of Boost.Text.
>
> > - What is your evaluation of the library's:
> >  * Design
>
> String layer: overall, a solution looking for a problem.

The problem this library was designed to solve was the problem that I
hate std::string. :)  It solves this problem for me quite nicely by
being a minimal alternative to same.  That being said, as I mentioned
earlier in this thread, the string layer (except for unencoded_rope)
can all just go away if that is reviewers' preference.

> The use of 'int' as the size_type in string/string_view is a time
> bomb waiting to go off.  I don't /typically/ have strings longer than
> 2GB, but I do use some text-based file formats for which there is no
> /inherent/ reason why the individual files can't exceed 2GB.

There is also no inherent reason why max_size() is less than SIZE_MAX,
and yet it is on every implementation popular today.  To be clear, my
point is that typical use is going to be well within either of these
for mutable string data.  Anything atypical can use something else.

> (And no,
> unencoded_rope would not be a better choice.  I can't memmap ropes, but
> I can memmap string_views.)

You can easily memmap string_views 2GB at a time, though.  That is a
simple workaround for this corner case you have mentioned, but it is
truly a corner case compared to the much more common case of using
strings for holding contiguous sequences of char.  Contiguous
sequences of char really should not be anywhere near 2GB, for
efficiency reasons.

> I like the use of negative indices for indexing from the end in
> Python, but I am ambivalent about using the same feature in C++.  None
> of the other types I regularly use in C++ work like that, and the
> runtime cost involved is a lot more noticeable in C++.

I don't really get what you mean about the runtime cost.  Could you be
more explicit?

> Also, using the
> same type of counting-from-end indices and counting-from-beginning
> indices seems unsafe.  A separate type for counting-from-end would be
> safer and faster, at the cost of being more syntactically heavy.

Interesting.  Do you mean something like a strong typedef for "index"
and "negative-index", or something else?  Some example code might be
instructive.

> My first impression for unencoded_rope_view was that it's useless.
> Then I saw the undo/redo example for segmented_vector, and I have since
> amended my opinion to "mostly useless".  To clarify:
>    - In the most common use for strings, where the string is created
> once, used, and discarded without ever being modified, unencoded_rope is
> worse than a contiguous string: slower lookup, slower traversal, and
> worse data locality.

Correct.  That's not its use case.

>    - The most common forms of "modification", in my experience, are
> most efficiently done by reconstructing the entire string, which again
> makes unencoded_rope useless.  (And concatenation, I guess, for which
> unencoded_rope actually great.)

Well, that really depends on the length of the string, the number and
kind of future edits, etc.

>    - Even if my modifications are primarily of the type for which
> unencoded_rope is optimized, there would need to be a fairly high
> modification-to-read-operation ratio for unencoded_string to be a net win.

Agreed.  If you use it in cases for which it is designed, it works
better than if you don't.  I'm not trying to be snarky; my point is
that it is a niche type -- it is useful anywhere you would like COW
semantics, or anywhere you would like to edit very long strings -- and
you should not use it outside of its niche.  How long "very long" is
is best revealed by measuring performance.  For most repeated editing
cases, I expect it to be far shorter than 2GB -- the limit of
text::string.

>    - But the fact that copies on unencoded_ropes are cheap, and the
> fact that these copies remain cheap when one copy is modified, makes
> unencoded_rope very useful for implementing undo/redo in a text
> editor operating on very large text files.

Exactly.  That is only one of its two primary use cases, though, as I
indicated above.  It also has an even more niche case, in that it is
easily used in a threadsafe manner.  My point is that it has its uses,
even if you don't typically need them yourself.

> Unicode layer: overall, very useful.
>
> One things that appears to be missing is a normalization-preserving
> append/insert/erase operators.  These are available in the text layer,
> but that means being tied to the specific text classes provided by that
> layer.

Hm.  I had not considered making these available as algorithms, and I
generally like the approach.  But could you be more specific?  In
particular, do you mean that insert() would take a container C and do
C.insert(), then renormalize?  This is the approach used by C++ 20's
erase() and erase_if() free functions.  Or did you mean something
else?

> Requiring unicode text to be in Stream-Safe Format is another time bomb
> waiting to go off, but it's also usability issue.  The library should
> provide an algorithm to put unicode text in Stream-Safe Format, and
> should automatically apply that algorithm whenever text is normalized.
> This would make it safe to use Boost.Text on data from an untrusted
> source so long as the data is normalized first, which you have to do
> with untrusted data anyway.

This seems like a good idea to me.  I went back and forth over whether
or not to supply the SSF algorithm, since it's not an official Unicode
algorithm, but adding it to text's normalization step would be reason
enough to do so.

> Text layer: overall, I don't like it.
>
> On one hand, there is the gratuitous restriction to FCC.  Why can't
> other normalization forms be supported, given that the unicode layer
> supports them?

Here is the philosophy:  If you have a template parameter for
UTF-encoding and/or normalization form, you have an interoperability
problem in your code.  Multiple text<...>'s may exist for which there
is no convenient or efficient interop story.  If you instead convert
all of your text to one UTF+normalization that you use throughout your
code, you can do all your work in that one scheme and transcode and/or
renormalize at the program input/output boundaries.

Practically, the world seems to have standardized on UTF-8, NFC (or is
doing so rapidly).  I'm not thrilled that I have to use FCC instead of
NFC, but it's close.  The reason for using FCC is that it removes the
need to normalize to NFD before doing collation.  Collation is
expensive enough without adding memory allocation and NFD
normalization to it.  I'm experimenting with adding extra tailorings
to Boost.Text's collation tables to make collation compatible with
NFC.  I'm not yet sure if that can be made to work.

> On the other hand, there is the restriction to the base types from the
> string layer.  Why can't 'text' be a class template taking the
> underlying container class as an argument?  (Including std::string?)

Because, again, I want there to be trivial interop.  Having
text<text::string> and text<std::string> serves what purpose exactly?
That is, I have never seen a compelling use case for needing both at
the same time.  I'm open to persuasion, of course.

> I have a lot of code, much of it in third-party libraries (such as
> Boost!), that uses std::string as a basic vocabulary type.  Converting
> between Boost.Text strings and std::strings at all API boundaries is not
> viable.  If I can't use the text layer with std::string, then I don't
> want it.

I understand.  See my previous comments about my lack of preciousness
about text::string. :)

> > * Implementation
>
> I didn't look at it in detail.
>
> > * Documentation
>
> Overall, very clear and useful.
>
> The warning about undefined behavior if the data is not in Stream-Safe
> Format should be at the top of every page to which it applies, in big
> and bold text, on a red background, ideally blinking.  Or, the library
> could take measures to handle it reasonably safe for handling untrusted
> data.

The SSF assumption is explicitly allowed in the Unicode standard, and
it's less onerous than not checking array-bounds access in operator[]
in one's array-like types.  Buffer overflows are really common, and
SSF violations are not.  That being said, I can add the
SSF-conformance algorithm as mentioned above.

> .../the_unicode_layer/collation.html: Unicode Technical Note #5
> describes an /extension/ to the unicode collation algorithm which /also/
> allows it to handle FCC /in addition to/ NFD.  If your collation
> algorithm doesn't handle NFD correctly, then there is something
> seriously wrong with it, and I except that it will also fail on the
> corner cases of FCC.

It does not fail for FCC, but does fail for NFC.  I honestly have not
tried it with NFD yet.  I will do so.

> Describing NFD as "memory-hogging" is FUD, as the actual difference in
> memory usage between NFD and FCC is trivial for most real-world data.
> (I suppose Korean Hangul would be a major exception to this.)

Ok.  I can certainly remove that from the docs.

> .../the_unicode_layer/searching.html: the note at the end of the
> page is wrong, assuming you implemented the algorithms correctly.  The
> concerns for searching NFD strings are similar to the concerns for
> searching FCC strings.
>
> In both FCC and NFD:
>    - There is a distinction between A+grave+acute and A+acute+grave,
> because they are not canonically equivalent.
>    - A+grave is a partial grapheme match for A+grave+acute.
>    - A+acute is not a partial grapheme match for A+grave+acute.
>    - A+grave is not a partial grapheme match for A+acute+grave.
>    - A+acute is a partial grapheme match for A+acute+grave.
> But:
>    - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.

Hm.  That note was added because the specific case mentioned fails to
work for NFD, but works for FCC and NFC.  I think the grapheme
matching analysis above addresses something different from what the
note is talking about -- the note is concerned with code-point-level
searching results producing (perhaps) surprising results.  The
grapheme-based search does not, so which CPs are matched by a
particular grapheme does not seem to be relevant.  Perhaps I'm missing
something?

> > * Tests
>
> I found it hard to get a good overview of the tests, seeing as a lot of
> them are dynamically generated test files.

Yes, and it's difficult for me, too.  There are a lot of files in
there.  As Emil suggested on another thread, I'm going to partition
the tests into generated and non-.  Unfortunately, that will happen
after the review.

> One thing I noticed is that there are extensive normalization tests for
> the four basic unicode normalization forms, but none for FCC.

Yeah, that's because they don't exist, as far as I know.  There are no
such tests in ICU or on the Unicode website that I can find.  That
being said, since everything related to collation and the Text layer
uses FCC, and it all works, I think that sufficient coverage exists.
If it makes you feel any more sure, the algorithm I use for
normalization is just an adaptation of ICU's code, and the NFC and FCC
code paths differ in only a couple of places, controlled by a bool
passed to the normalization algorithm -- it's 99.9% the same code.

> In the text_.cpp test file, there are some raw unicode code points
> without either the actual character or its name, making it hard to see
> what's going on.

That was to make it clear to me which code point is being used. :)  To
each his own, I guess.

> In the text_.cpp test file, in the normalization tests, there are some
> edge cases that do not seem to be tested:
>    - Erasing a combining character such that another combining character
> merges into the previous base character.  (Example: removing the cedilla
> in a A+cedilla+ring_above sequence, yielding the precomposed
> A_with_ring_above character.)
>    - Inserting a combining character such that this causes a precomposed
> character to decompose.  (Example: inserting a cedilla after
> A_with_ring_above, yielding a A+cedilla+ring_above.)
>    - Inserting a combining character that requires existing combining
> characters to be reordered.

Thanks!  I'll add those.

> "empty" is misspelled as "emtpy" in text_.cpp.

Ah, thanks.

> > * Usefulness - Did you attempt to use the library?
>
> I was not able to get the library to build, so I was not able to test
> it.  But it does look like it should be a good ICU replacement for my
> purposes, assuming it doesn't have any serious bugs.

Huh.  What was the build problem?  Was it simply that it takes forever
to build all the tests?

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
On 12.06.20 21:56, Zach Laine via Boost wrote:

>> (And no,
>> unencoded_rope would not be a better choice.  I can't memmap ropes, but
>> I can memmap string_views.)
>
> You can easily memmap string_views 2GB at a time, though.  That is a
> simple workaround for this corner case you have mentioned, but it is
> truly a corner case compared to the much more common case of using
> strings for holding contiguous sequences of char.  Contiguous
> sequences of char really should not be anywhere near 2GB, for
> efficiency reasons.

A memmapped string_view /is/ a contiguous sequence of char.  I don't see
the difference.

>> I like the use of negative indices for indexing from the end in
>> Python, but I am ambivalent about using the same feature in C++.  None
>> of the other types I regularly use in C++ work like that, and the
>> runtime cost involved is a lot more noticeable in C++.
>
> I don't really get what you mean about the runtime cost.  Could you be
> more explicit?

Somewhere in the implementation of operator[] and operator(), there has
to be a branch on index < 0 (or >= 0) in order for that negative index
trick to work, which the compiler can't always optimize away.  Branches
are often affordable but they're not free.

>> Also, using the
>> same type of counting-from-end indices and counting-from-beginning
>> indices seems unsafe.  A separate type for counting-from-end would be
>> safer and faster, at the cost of being more syntactically heavy.
>
> Interesting.  Do you mean something like a strong typedef for "index"
> and "negative-index", or something else?

Yes, something like that.

>> One things that appears to be missing is a normalization-preserving
>> append/insert/erase operators.  These are available in the text layer,
>> but that means being tied to the specific text classes provided by that
>> layer.
>
> Hm.  I had not considered making these available as algorithms, and I
> generally like the approach.  But could you be more specific?  In
> particular, do you mean that insert() would take a container C and do
> C.insert(), then renormalize?  This is the approach used by C++ 20's
> erase() and erase_if() free functions.  Or did you mean something
> else?

I hadn't thought through the interface in detail.  I just saw that this
was a feature of the text layer, and thought it would be nice to have in
the unicode layer, because I don't want to use the text layer (in its
current form).

>> Requiring unicode text to be in Stream-Safe Format is another time bomb
>> waiting to go off, but it's also usability issue.  The library should
>> provide an algorithm to put unicode text in Stream-Safe Format, and
>> should automatically apply that algorithm whenever text is normalized.
>> This would make it safe to use Boost.Text on data from an untrusted
>> source so long as the data is normalized first, which you have to do
>> with untrusted data anyway.
>
> This seems like a good idea to me.  I went back and forth over whether
> or not to supply the SSF algorithm, since it's not an official Unicode
> algorithm, but adding it to text's normalization step would be reason
> enough to do so.
>
>> Text layer: overall, I don't like it.
>>
>> On one hand, there is the gratuitous restriction to FCC.  Why can't
>> other normalization forms be supported, given that the unicode layer
>> supports them?
>
> Here is the philosophy:  If you have a template parameter for
> UTF-encoding and/or normalization form, you have an interoperability
> problem in your code.  Multiple text<...>'s may exist for which there
> is no convenient or efficient interop story.  If you instead convert
> all of your text to one UTF+normalization that you use throughout your
> code, you can do all your work in that one scheme and transcode and/or
> renormalize at the program input/output boundaries.

Having to renormalize at API boundaries can be prohibitively expensive.

> Because, again, I want there to be trivial interop.  Having
> text<text::string> and text<std::string> serves what purpose exactly?
> That is, I have never seen a compelling use case for needing both at
> the same time.  I'm open to persuasion, of course.

The advantage of text<std::string> is API interop with functions that
accept std::string arguments.  I'm not sure what the advantage of
text<boost::text::string> is.  But if we accept that boost::text::rope
(which is would just be text<boost::text::unencoded_rope> in my scheme)
is useful, then it logically follows that
text<some_other_string_implementation> could also be useful.

> The SSF assumption is explicitly allowed in the Unicode standard, and
> it's less onerous than not checking array-bounds access in operator[]
> in one's array-like types.  Buffer overflows are really common, and
> SSF violations are not.  That being said, I can add the
> SSF-conformance algorithm as mentioned above.

Unintential SSF violations are rare.  Intentional SSF violations can be
used as an attack vector, if "undefined behavior" translates to "memory
error".

>> .../the_unicode_layer/searching.html: the note at the end of the
>> page is wrong, assuming you implemented the algorithms correctly.  The
>> concerns for searching NFD strings are similar to the concerns for
>> searching FCC strings.
>>
>> In both FCC and NFD:
>>     - There is a distinction between A+grave+acute and A+acute+grave,
>> because they are not canonically equivalent.
>>     - A+grave is a partial grapheme match for A+grave+acute.
>>     - A+acute is not a partial grapheme match for A+grave+acute.
>>     - A+grave is not a partial grapheme match for A+acute+grave.
>>     - A+acute is a partial grapheme match for A+acute+grave.
>> But:
>>     - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
>
> Hm.  That note was added because the specific case mentioned fails to
> work for NFD, but works for FCC and NFC.  I think the grapheme
> matching analysis above addresses something different from what the
> note is talking about -- the note is concerned with code-point-level
> searching results producing (perhaps) surprising results.  The
> grapheme-based search does not, so which CPs are matched by a
> particular grapheme does not seem to be relevant.  Perhaps I'm missing
> something?

I am talking about code point level matches here.  ("Partial grapheme
match" means "matches some code points within a grapheme, not not the
whole grapheme".)

>> I was not able to get the library to build, so I was not able to test
>> it.  But it does look like it should be a good ICU replacement for my
>> purposes, assuming it doesn't have any serious bugs.
>
> Huh.  What was the build problem?  Was it simply that it takes forever
> to build all the tests?

The configuration step failed because it tries to compile and run test
programs in order to gather information about my environment, and I was
running in a cross-compile context which prevents CMake from running the
programs that it compiles.  Probably not too hard to work around on my
part by simply not using a cross-compile context.


--
Rainer Deyke ([hidden email])


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
Why does Boost.Text contains trie types in their own boost::trie
namespace?

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Fri, Jun 12, 2020 at 6:40 AM Niall Douglas via Boost <
[hidden email]> wrote:
> Said same built-in string object would also
> be available to C++, by definition.

C++ already has such an object, it's called std::string, the implementation
can do what it wants with it. To use it in C, all you need to do is compile
your C program with a C++ compiler.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Fri, Jun 12, 2020 at 10:36 AM Rainer Deyke via Boost <
[hidden email]> wrote:
> String layer: overall, a solution looking for a problem.

I tend to agree with this but I had assumed there is a reason why it
exists. Zach indicated that he doesn't care much about it, so it probably
should be deleted, but I don't feel strongly about this because there are
plenty of C++ APIs which define their own string type, and I don't see this
as a problem.

In C++, we have many of everything.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Fri, Jun 12, 2020 at 10:36 AM Rainer Deyke via Boost
<[hidden email]> wrote:
> String layer: overall, a solution looking for a problem.

I created boost::json::string to use instead of std::string for a
couple of reasons:

1. It uses a type-erased, reference counted memory resource instead of
Allocator template parameter
2. The class is laid out a certain way to greatly minimize its size
3. The class layout is optimized to keep the size of the enclosing
variant (the JSON DOM value type) small
4. The API is the same for all C++ versions (unlike std::string)
5. It is not a class template

I would like to know what are the exact _technical_ benefits of
Boost.Text's String layer, beyond "because I don't like std::string"

Thanks

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Sat, Jun 13, 2020 at 9:51 AM Bjorn Reese via Boost
<[hidden email]> wrote:
>
> Why does Boost.Text contains trie types in their own boost::trie
> namespace?

That naming is a mistake - an artifact of deciding where trie should
fit (e.g. separate library or not). If accepted, a Boost.Text library
will only add identifiers (namespaces included) to the boost::text::
namespace.

Glen

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Fri, Jun 12, 2020 at 4:15 PM Rainer Deyke via Boost
<[hidden email]> wrote:

>
> On 12.06.20 21:56, Zach Laine via Boost wrote:
> >> (And no,
> >> unencoded_rope would not be a better choice.  I can't memmap ropes, but
> >> I can memmap string_views.)
> >
> > You can easily memmap string_views 2GB at a time, though.  That is a
> > simple workaround for this corner case you have mentioned, but it is
> > truly a corner case compared to the much more common case of using
> > strings for holding contiguous sequences of char.  Contiguous
> > sequences of char really should not be anywhere near 2GB, for
> > efficiency reasons.
>
> A memmapped string_view /is/ a contiguous sequence of char.  I don't see
> the difference.

The difference is mutability.  There's no perf concern with erasing
the first element of a string_view, if that's not even a supported
operation.

> > I don't really get what you mean about the runtime cost.  Could you be
> > more explicit?
>
> Somewhere in the implementation of operator[] and operator(), there has
> to be a branch on index < 0 (or >= 0) in order for that negative index
> trick to work, which the compiler can't always optimize away.  Branches
> are often affordable but they're not free.

Ah, I see, thanks.  Would it make you feel better if negative indexing
 were only used when getting substrings?

> >> One things that appears to be missing is a normalization-preserving
> >> append/insert/erase operators.  These are available in the text layer,
> >> but that means being tied to the specific text classes provided by that
> >> layer.
> >
> > Hm.  I had not considered making these available as algorithms, and I
> > generally like the approach.  But could you be more specific?  In
> > particular, do you mean that insert() would take a container C and do
> > C.insert(), then renormalize?  This is the approach used by C++ 20's
> > erase() and erase_if() free functions.  Or did you mean something
> > else?
>
> I hadn't thought through the interface in detail.  I just saw that this
> was a feature of the text layer, and thought it would be nice to have in
> the unicode layer, because I don't want to use the text layer (in its
> current form).

I don't need a detailed interface.  Pseudocode would be fine too.

> >> Text layer: overall, I don't like it.
> >>
> >> On one hand, there is the gratuitous restriction to FCC.  Why can't
> >> other normalization forms be supported, given that the unicode layer
> >> supports them?
> >
> > Here is the philosophy:  If you have a template parameter for
> > UTF-encoding and/or normalization form, you have an interoperability
> > problem in your code.  Multiple text<...>'s may exist for which there
> > is no convenient or efficient interop story.  If you instead convert
> > all of your text to one UTF+normalization that you use throughout your
> > code, you can do all your work in that one scheme and transcode and/or
> > renormalize at the program input/output boundaries.
>
> Having to renormalize at API boundaries can be prohibitively expensive.

Sure.  Anything can be prohibitively expensive in some context.  If
that's the case in a particular program, I think it is likely to be
unacceptable to use text::operator+(string_view) as well, since that
also does on-the-fly normalization.  Someone, somewhere, has to pay
that cost if you want to use two chunks of text in
encoding/normalization A and B.  You might be able to keep working in
A for some text and keep working in B separately for other text, but I
think code that works like that is going to be hard to reason about,
and will be as common as code that freely mixes wstring and string
(and I mean not only at program boundaries).  That is, not very
common.

However, that is a minority of cases.  The majority case is that texts
have to be able to interop within your program arbitrarily, and so you
need to pay the conversion cost somewhere eventually anyway.

FWIW, I'm planning to write standardization papers for the Unicode
layer stuff for C++23, and the text stuff in the C++26 timeframe.  My
hope is that we will adopt my text design here into Boost in plenty of
time to see whether it is actually as workable as I claim.  I'm open
to the idea of being wrong about its design and changing it to a
template if a nontemplate design turns out to be problematic.

> > Because, again, I want there to be trivial interop.  Having
> > text<text::string> and text<std::string> serves what purpose exactly?
> > That is, I have never seen a compelling use case for needing both at
> > the same time.  I'm open to persuasion, of course.
>
> The advantage of text<std::string> is API interop with functions that
> accept std::string arguments.

Sure.  That exists now, though it does require a copy.  It could also
be done via a move if I replace text::string with std::string within
text::text, which I expect to as a result of this review.

> I'm not sure what the advantage of
> text<boost::text::string> is.  But if we accept that boost::text::rope
> (which is would just be text<boost::text::unencoded_rope> in my scheme)

That does not work.  Strings and ropes have different APIs.

> is useful, then it logically follows that
> text<some_other_string_implementation> could also be useful.

That's what I don't get.  Could you explain how text<A> and text<B>
are useful in a specific case?  "Could also be useful" is not
sufficient motivation to me.  I understand the impulse, but I think
that veers into over-generality in a way that I have found to be
problematic over and over in my career.

> >> .../the_unicode_layer/searching.html: the note at the end of the
> >> page is wrong, assuming you implemented the algorithms correctly.  The
> >> concerns for searching NFD strings are similar to the concerns for
> >> searching FCC strings.
> >>
> >> In both FCC and NFD:
> >>     - There is a distinction between A+grave+acute and A+acute+grave,
> >> because they are not canonically equivalent.
> >>     - A+grave is a partial grapheme match for A+grave+acute.
> >>     - A+acute is not a partial grapheme match for A+grave+acute.
> >>     - A+grave is not a partial grapheme match for A+acute+grave.
> >>     - A+acute is a partial grapheme match for A+acute+grave.
> >> But:
> >>     - A is a partial grapheme match for A+grave+acute in NFD, but not in FCC.
> >
> > Hm.  That note was added because the specific case mentioned fails to
> > work for NFD, but works for FCC and NFC.  I think the grapheme
> > matching analysis above addresses something different from what the
> > note is talking about -- the note is concerned with code-point-level
> > searching results producing (perhaps) surprising results.  The
> > grapheme-based search does not, so which CPs are matched by a
> > particular grapheme does not seem to be relevant.  Perhaps I'm missing
> > something?
>
> I am talking about code point level matches here.  ("Partial grapheme
> match" means "matches some code points within a grapheme, not not the
> whole grapheme".)

Ah, I see.  Thanks.  I'll update the note.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
On 14.06.20 01:25, Zach Laine via Boost wrote:
> On Fri, Jun 12, 2020 at 4:15 PM Rainer Deyke via Boost
> <[hidden email]> wrote:
>> A memmapped string_view /is/ a contiguous sequence of char.  I don't see
>> the difference.
>
> The difference is mutability.  There's no perf concern with erasing
> the first element of a string_view, if that's not even a supported
> operation.

A /lot/ of strings, probably the vast majority, will never be mutated.
And for the rest, the majority will only be mutated by appending.

Erasing the first element is a nice to have but expensive and rarely
used feature.  If you find yourself doing that a lot, then you probably
do want a rope.

>> Somewhere in the implementation of operator[] and operator(), there has
>> to be a branch on index < 0 (or >= 0) in order for that negative index
>> trick to work, which the compiler can't always optimize away.  Branches
>> are often affordable but they're not free.
>
> Ah, I see, thanks.  Would it make you feel better if negative indexing
>   were only used when getting substrings?

That does address the performance problem, so yes.

>> I hadn't thought through the interface in detail.  I just saw that this
>> was a feature of the text layer, and thought it would be nice to have in
>> the unicode layer, because I don't want to use the text layer (in its
>> current form).
>
> I don't need a detailed interface.  Pseudocode would be fine too.

insert_nfd(string, position, thing_to_insert)
// Insert 'thing_to_insert' into 'string' at 'position'.  Both 'string'
// and 'thing_to_insert' are required to be in NFD.  The area around the
// insertion is renormalized to NFD.


>> Having to renormalize at API boundaries can be prohibitively expensive.
>
> Sure.  Anything can be prohibitively expensive in some context.  If
> that's the case in a particular program, I think it is likely to be
> unacceptable to use text::operator+(string_view) as well, since that
> also does on-the-fly normalization.

Hopefully only on the string_view and the area immediately surrounding
the insertion.

> Someone, somewhere, has to pay
> that cost if you want to use two chunks of text in
> encoding/normalization A and B.  You might be able to keep working in
> A for some text and keep working in B separately for other text, but I
> think code that works like that is going to be hard to reason about,
> and will be as common as code that freely mixes wstring and string
> (and I mean not only at program boundaries).  That is, not very
> common.

Which is why I want to avoid just that.

Your suggestions:

   void f() {
     // renormalizes to fcc
     text::text t = api_funtion_that_returns_nfd();
     do_something_with(t);
     string s;
     text::normalize_to_nfd(t.extract(), back_inserter(s));
     api_function_that_accepts_nfd(s);
   }

My suggestion:

   void f() {
     text::text<nfd, std::string> t = api_function_that_returns_nfd();
     do_something_with(t);
     api_function_that_accepts_nfd(t.extract());
   }

> That's what I don't get.  Could you explain how text<A> and text<B>
> are useful in a specific case?

text<deque<char> >, for fast insertion/removal at both ends?

But it's really text::text<std::string> that I'm after, so if
text::text<std::string> becomes just text::text, then I'm satisfied.


--
Rainer Deyke ([hidden email])


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
On Sun, Jun 14, 2020 at 7:25 AM Rainer Deyke via Boost
<[hidden email]> wrote:

>
> On 14.06.20 01:25, Zach Laine via Boost wrote:
> > On Fri, Jun 12, 2020 at 4:15 PM Rainer Deyke via Boost
> > <[hidden email]> wrote:
> >> A memmapped string_view /is/ a contiguous sequence of char.  I don't see
> >> the difference.
> >
> > The difference is mutability.  There's no perf concern with erasing
> > the first element of a string_view, if that's not even a supported
> > operation.
>
> A /lot/ of strings, probably the vast majority, will never be mutated.

Ok, then those should more appropriately be string_views.

> And for the rest, the majority will only be mutated by appending.

That does not help, unless the capacity is so large that a
reallocation is unnecessary.

> Erasing the first element is a nice to have but expensive and rarely
> used feature.  If you find yourself doing that a lot, then you probably
> do want a rope.

Any mutation might cause a reallocation.  I named one of the
worst-case operations rhetorically, but appending is also bad if it
causes that reallocation.  It's not a question of what kind of
mutating operation you're doing, but whether you're mutating or not.

> >> I hadn't thought through the interface in detail.  I just saw that this
> >> was a feature of the text layer, and thought it would be nice to have in
> >> the unicode layer, because I don't want to use the text layer (in its
> >> current form).
> >
> > I don't need a detailed interface.  Pseudocode would be fine too.
>
> insert_nfd(string, position, thing_to_insert)
> // Insert 'thing_to_insert' into 'string' at 'position'.  Both 'string'
> // and 'thing_to_insert' are required to be in NFD.  The area around the
> // insertion is renormalized to NFD.

I see -- no surprises here.  As I said, I like this idea a lot!
However, see below.

> >> Having to renormalize at API boundaries can be prohibitively expensive.
> >
> > Sure.  Anything can be prohibitively expensive in some context.  If
> > that's the case in a particular program, I think it is likely to be
> > unacceptable to use text::operator+(string_view) as well, since that
> > also does on-the-fly normalization.
>
> Hopefully only on the string_view and the area immediately surrounding
> the insertion.

No, that's why I picked string_view, and not text_view.  text_view
insertion does not normalize the incoming text, but string_view
insertion does.  This is in keeping with the philosophy:

- At program I/O boundaries (not all API boundaries), convert to UTF-8 and FCC.
- Internal interfaces that take UTF-8/FCC will not transcode or normalize.
- Internal interface that take non-UTF-8/FCC will transcode and
normalize as needed.

text::operator+(string_view sv) does not know the normalization of sv,
so it normalizes.  The alternative is clunky -- you have to make a new
string somewhere to normalize into, and then use operator+() on the
result.

> > Someone, somewhere, has to pay
> > that cost if you want to use two chunks of text in
> > encoding/normalization A and B.  You might be able to keep working in
> > A for some text and keep working in B separately for other text, but I
> > think code that works like that is going to be hard to reason about,
> > and will be as common as code that freely mixes wstring and string
> > (and I mean not only at program boundaries).  That is, not very
> > common.
>
> Which is why I want to avoid just that.
>
> Your suggestions:
>
>    void f() {
>      // renormalizes to fcc
>      text::text t = api_funtion_that_returns_nfd();
>      do_something_with(t);
>      string s;
>      text::normalize_to_nfd(t.extract(), back_inserter(s));
>      api_function_that_accepts_nfd(s);
>    }
>
> My suggestion:
>
>    void f() {
>      text::text<nfd, std::string> t = api_function_that_returns_nfd();
>      do_something_with(t);
>      api_function_that_accepts_nfd(t.extract());
>    }

Right, I get it.  I just think you're leaving out the lack of
interoperability with text::text<nfc, std::wstring>, etc.  That's not
a trivial concern.

If you have code that needs to stay NFC as in your example, you should
be able to use std::string and insert_nfc() and friends.  This is yet
another case where a perf tradeoff forces you to write a bit more
code.  That does not seem onerous to me.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [json] [text] Multiple, custom string tyypes

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Mon, Jun 15, 2020 at 4:18 AM Andrzej Krzemienski <[hidden email]> wrote:
> can they use their string in Boost.JSON?

No. The design of the library explicitly avoids allowing customizability.

> It should be possible to parametrize Boost.JSON with the string type.

This has been explored in other libraries, and I don't like the
result. Here is nlohmann JSON's value type:

template<
        template<typename, typename, typename...> class ObjectType,
        template<typename, typename...> class ArrayType,
        class StringType,
        class BooleanType,
        class NumberIntegerType,
        class NumberUnsignedType,
        class NumberFloatType,
        template<typename> class AllocatorType,
        template<typename, typename = void> class JSONSerializer
        >
    class basic_json
    {
      private:
        ....
        friend ::nlohmann::detail::parser<basic_json>;
        friend ::nlohmann::detail::serializer<basic_json>;
    ...

<http://vinniefalco.github.io/doc/json/json/comparison.html#json.comparison.comparison_to_nlohmann_json>

Thanks, but no thanks.

As I said the library achieves its performance partly due to the
layout of the different types used in the discriminated union, of
which string is one. Users aren't asking for "using their own string
type in the JSON DOM." What they want is the speed of RapidJSON,
options for skipping comments, allowing trailing commas, and flexible
parsing, and the interface quality of a Boost library.

Thanks

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [json] [text] Multiple, custom string tyypes

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Mon, Jun 15, 2020 at 4:18 AM Andrzej Krzemienski <[hidden email]> wrote:
> It should be possible to parametrize Boost.JSON with the string type.

Another problem with doing this, is that it breaks the invariant of
Boost.JSON's DOM value type:

* All child elements in a DOM will inherit the allocator of the root element.

If we allow user-defined strings, first they probably won't work with
storage_ptr (Boost.JSON's custom allocator container interface)  and
second they can break the invariant.

My goal is for boost::json::value to be suited as a vocabulary type.
Allowing customization works against that. An example of an existing
useless customization point would be char traits in basic_string
and/or basic_string_view. I don't want to create more.

Thanks

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

[review] [text] Text formal review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
This is a review for the Boost.Text library, submitted a day late (but
hopefully not a dollar short! (U.S. colloquialism, don't mind me!)).

The library has 3 somewhat related but (somewhat?) separable
sub-libraries. In "building block" order, these are:

- A string layer (a new std::string)
- A unicode layer (algorithms and data)
- A text layer (string, but if it gave a single flying crap about Unicode)

There are 4 (3?) types to care about in the string layer:
unencoded_rope, string, segmented_vector (and string_builder...?).
There are not many new types in the unicode layer save for things that
help the algorithms/data do the things well and report findings from
algorithm calls.
There are 2 types to care about in the unicode layer: text, and rope.

We will start with the lowest building block layer. This will likely
not be a typical review: most others have called out in the
documentation and other places things that have failed, so I will
focus primarily on the utility and design of the layers and what they
can bring to the table.

======
Layer 0
======

[[string]]
It gets rid of char_traits (yay!) but then also throws out the
allocator (ewwww!). ... That's it.

This type does not affect my view of the library because it can be
(mostly) safely ignored. It can be nuked from orbit and nothing of
value will be lost. I would actually recommend std::string be used
underneath, because why do this to the ecosystem for the (N+1)th time?

[[unencoded_rope]][[segmented_vector]]
These two data structures are FAR more spicy and incredibly
interesting. They provide different guarantees of insertion and
erasure complexity. Of course, neither have allocators built in so I
can't really customize how this works without hijacking global new and
delete, but Zach has made clear his distaste for the allocator world
and having recently built several standard containers I don't blame
him.

Nevertheless, both of these data structures are being talked about
together because they provide the same type of functionality:
unencoded_rope is just specialized for char storage. Notably,
segmented_vector has an insert for (iterator, value_type) while
unencoded_rope seemed to be missing that and only wanted to deal with
"strings"/ranges, rather than single elements. This made me applying
my fun text-wrapper on unencoded_rope mildly annoying because
single-insert was just not present:

phd::text::basic_text<utf8, nfd, boost::text::unencoded_rope> wee;
wee.insert(u8'A'); // kabloosh!

Nevertheless, my "shortcuts" for single insertion are honestly a waste
of space because I can just turn that into a range of size 1 and use
less of the "required" SequenceContainer
(https://en.cppreference.com/w/cpp/named_req/SequenceContainer) bits
'n' bobs anyway. So no real harm, no actual foul!

Running my tests with an encoding slapped on top of the unencoded rope
or the segmented vector worked, which meant I could get a different
storage policy with the nfd normalization form and the encoding of my
choice. (Well, I only tested utf8/16/32, one byte encoding, and then
the current execution character set (which was just utf8 anyways so
that's not really that exhaustive, is it?)).

This layer has immense value. Keep it and ship it; great job, Zach!

[[string_builder]]
I think this is vestigial. So, uh, doesn't really affect the review,
and I don't care for it?


=======
Layer 1
=======

Yes.

... That's it. That's literally it: ship it. Goddamn, ship this layer
like your favorite movie couple. This is what we need. This is what we
crave. It's ICU, except if ICU went to study under Stepanov, Lee and
Plauger instead of Gosling, Sheridan and Naughton. No complaints, no
problems: having this layer makes this library ABSOLUTELY worth it,
210%. There are even special normalize-in-place algorithms for
strings, which can save on performance. You can implement your own
Unicode text-aware layer on top of this stuff, it provides a robust
set of algorithms and normalization forms (hell yeah!) and makes every
second that this library is not in Boost a tragedy.

Passed the necessary tests on my machine despite taking an age, but
that's moreso because generated Unicode tests is a doozy.

Speaking of "implement your own Unicode text-aware layer..."


======
Layer 2
======

This is the layer I am -- on a library design level and a personal
philosophy level -- the most opposed to.

But my answer is still to accept it (well, modulo it being based on
the above string type. Please just use std::string).

[[ text ]] [[ rope ]]
While these containers can be evaluated individually, other reviews
have picked up a great deal of pickings at them and so I won't bother.
There was some grumbling about how a rope-like data structure is not
interesting enough to be included and I will just quietly wave that
off as "my use case is the only use case that matters and therefore I
don't care about other people's invariants or needs".

There are many implicitly (and explicitly) stated and maintained
opinions in this layer:

- UTF-8 is the way, truth, and life.
- Unicode is the only encoding that matters ever, for all time, in perpetuity.
- Allocators are shit!
- NFC is probably the best we can do here for varying reasons.
- Who needs non-contiguous storage anyways?
- Who needs non-SBO storage, anyways?

These are all opinions, many of which are present in the design of the
text container. And they allow this text container to ship. But that
lack of flexibility -- while okay for Qt or Apple's CoreText or
whatever other platform-specific hoo-ha you want to get involved with
-- does not help. In fact, it cuts them off: more than one person
during Meeting C++ spoke to me of Boost.Text and said it could not
meet their needs because it maintained encoding or normalization
invariants that did not interoperate with their existing system.
Storage is also an issue: while "I use boost::text::string underneath"
is fine and dandy, many systems (next to none, maybe?) are going to
speak in "text" or its related string type. They will want the
underlying container to speak to. For duck-type purposes, it works.
But for everyone else, it fails.

Since the string layer uses an `int` for its size and capacity, it is
lopsidedly incompatible with existing STL's implementations of string,
to the point that a reinterpret_cast -- however evil -- is not
suitable for transporting a reference-without-copy into these APIs.
God bless string_view and its friends, because it allows us to at
least continue to talk to some APIs since the text type guarantees
contiguous storage. This means that at the boundaries of an
application -- or even as a plugin to a wider ecosystem -- I am paying
a (sometimes obscene) cost to interoperate between
std::string/llvm::SmallString/unicode_code_unit_sequence and all the
other things people have developed to sit between them and what they
believe their string needs are. And while it is whack that so many of
these classes exist,

they do.

That lack of interoperability -- and once again, the lack of an
allocator template parameter -- hampers this library from COMPLETELY
DOMINATING the string scene. It will always be used as a solution,
maybe even 80% of the time. Those seeking more will have to figure out
how to build their own UTF16 containers, or their own special-encoded
containers, with very little support from the text library (save for
some transcoding functions they can leverage, but only from specific
Unicode encodings).

Onto the good news: the text and rope classes work like I expect them
to. Pass my tests. A+ great job keeps my text in utf8 and the
prescribed normalization form! Despite the length of my previous
critique that basically amounts to "who died and made you King of my
string layout and memory allocation?", this layer and the library
should still be accepted.

============
Okay, Seriously?
=============

Yep.

See, the problem right now with C++ -- and the standard in General --
is that we like to wait for something to bake for an eternity, often
long after it's useful and necessary for the end user. In C++11 we
introduced a "codecvt"-style thing called "std::wstring_convert",
whose sole purpose was transcoding, plus or minus some platform
shenanigans. It was implemented poorly on almost all platforms, its
performance is hot garbage
(https://github.com/ThePhD/sol2/issues/571), and it generally was a
bug-ridden mess.

But it shipped.

What we did when we both deprecated and removed std::wstring_convert
and its related facets is we took a real pain point in the C++
community and decided to make it far worse than it already was. See,
C++ -- and C++11 -- were steaming piles of dogpoo when it came to
Unicode (https://stackoverflow.com/a/17106065). So when
wstring_convert came on the scene, it was a breath of fresh air. Yeah,
the performance is garbage, yes the interface is trash, yes it hasn't
learned anything from Stepanov's fantastic work, but it was there. It
was workable. And it was standard.

And the Committee ripped it out of the user's hands.

Boost.Text, for however many extremely opinionated decisions it makes
that ends up excluding certain parts of the C++ ecosystem, provide a
SORELY needed relief for the majority of the C++ community who have
been struggling for the tiniest bit of a text solution. So even if the
storage has a mandated encoding; a strict normalization form is given;
and, everything else costs you a pound of flesh to build yourself, the
whole point is that there is a default, and it is a pretty good
default.

This is something that cannot be understated in the slightest; we have
nothing -- and I mean, N O T H I N G -- that reflects a good C++
library for Unicode. Even if you do not like Zach's decisions, other
people can pick up Zach's container types and run with them for quite
a while. Sure, the 7x performance gains I got in my last job using
solely allocators is impossible with Boost Text! But, Layer 1 exists:
I can leverage well-done Unicode algorithms to do the job I need to,
even if it is not as convenient and pre-packaged as I would like it to
be. This is not only important for the ecosystem at large, but for the
Boost Community. For a long time people have wondered if Boost will
lead the charge towards a better, brighter future by solving problems
that users face the most, or if it would fade into
compatibility-library obscurity and be repeatedly reviled for its
special build needs and required setup over its standard library
equivalents.

Boost.Text is one of many libraries I *expect* to see land in Boost to
solve critical problems, to be iterated and shipped towards the wider
C++ ecosystem and have an impact that most library developers would
only dream of.


==========
In Conclusion
==========

Just one really big thing for me:

- Use std::string underneath. `int` is not a good size type. People
work with strings larger than 1 GB (INT_MAX / 2, as reported by the
string implementation).

Other people commented on the other fixes I would care about and most
of those have already been noted, thanks! Other than that...

Please accept Boost.Text for inclusion in the next available version
of Boost and continue to work towards the end of our collective 40
year string nightmare.

We can sort out COMPLETE DOMINATION of the design space a little
later, since this design is -- thankfully -- not one that is immune to
source backwards compatible improvements.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [review] [text] Text formal review

Boost - Dev mailing list
On Sun, Jun 21, 2020 at 7:16 AM JeanHeyd Meneide via Boost
<[hidden email]> wrote:
> ...neither have allocators built in so I
> can't really customize how this works without hijacking global new and
> delete, but Zach has made clear his distaste for the allocator world
> ...
> - Allocators are shit!
> ...
> "who died and made you King of my string...memory allocation?
> ...

The inability to control how and where memory is allocated is a
significant obstacle to using ANY container in a concurrent network
server. In every case, the first step of improving the performance of
such a program is to reduce the number of calls to allocate memory.
Often, simply reducing memory allocations is sufficient to give the
program desired performance.

A typical asynchronous network program follows a predictable cycle of
operations in a thread:

1. Data is received from the remote peer
2. Memory is allocated to perform the operation
3. The operation is performed
4. Allocated memory is freed
5. Control of the thread is returned to the I/O subsystem

It is in steps 2 and 4 where a custom allocator is used to optimize
the I/O cycle. The techniques are simple:

1. Reuse a previously allocated block of memory
2. Utilize the stack for storage (preferred when the storage
requirements for the operation are bounded)

Without the ability to control the allocator, these optimizations are
not possible with the containers in Boost.Text.

I have not looked deeply enough into Boost.Text, nor am I
knowledgeable enough about Unicode to write a competent review.
However, were I to write such a review I would REJECT the library for
the sole reason that it does not support allocators and thus cannot be
used optimally in a network program.

If the author decides to add allocator support, I suggest rather than
using an Allocator template parameter, to instead use
`boost::container::memory_resource&` as a constructor parameter (or
`std::pmr::memory_resource&` if available). This allows the container
to be implemented as an ordinary class instead of a class template.

Thanks

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [SG16] [review] [text] Text formal review

Boost - Dev mailing list
On Sun, 21 Jun 2020 at 18:32, Vinnie Falco via SG16
<[hidden email]> wrote:
> If the author decides to add allocator support, I suggest rather than
> using an Allocator template parameter, to instead use
> `boost::container::memory_resource&` as a constructor parameter (or
> `std::pmr::memory_resource&` if available). This allows the container
> to be implemented as an ordinary class instead of a class template.

The unfortunate bit in doing that is that a memory_resource doesn't
provide hooks
for deciding how it propagates, so you'll be nailing that question
down to one answer.
It sure looks to me that there's more than one answer to that question.

If you're using an allocator, make it a template parameter. Some
proponents of polymorphic
allocators tell their audiences that with the advent of polymorphic
allocators, allocators
as template parameters become unnecessary. Based on my experience in this area,
they are mistaken.

Or perhaps in other words: a memory_resource is not an allocator. It's
a mechanism for acquiring
and releasing memory, but that's not all an allocator is, there's more
to it than that.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
12