Boost.Text review

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Boost.Text review

Boost - Dev mailing list
This is my review of Zach Laine's proposed Boost.Text.

Zach is offering several things here: Unicode, a "better string",
a trie, and more. I'm pleased that these components are all
individually exposed to users and are independent of each other,
rather than some being hidden as implementation details.

I've spent a couple of days looking at a few parts of the library.
I'll discuss at the Unicode stuff first.

Boost gained support for Unicode functionality when Boost.Locale
was accepted almost a decade ago, but I was not particularly
impressed by that library and have not used it in my own work.
Does the current proposal do better? I think it does, and
though it's not perfect I think it would be worth accepting but
subject to a concern about copyright.

See another thread for the copyright discussion. Basically,
if I understand correctly, Zach's code includes tables that are
derived from Unicode data and the licence for that data requires
a copyright attribution. The Boost licence does not require
attribution and I support that. So if I've understood the
situation correctly - and I am not a lawyer! - this means that
sadly Boost must REJECT the affected code.

I'll say some more about the other aspects of the Unicode code,
just in case the copyright issue turns out to be surmountable.

I've tried just a couple of things. Firstly, I tried the UTF-8
to UTF-32 conversion. This seems to work, and the API is
reasonable. Performance is not bad, but it could be better
(see another thread); that could be fixed later. I think it
would be worth having just a UTF library in Boost!

Secondly I looked at the line-splitting algorithm, i.e.

template<typename CPIter, typename Sentinel, typename Extent,
         typename CPExtentFunc>
  unspecified lines(CPIter first, Sentinel last, Extent max_extent,
                    CPExtentFunc cp_extent, bool break_overlong_lines = true);

Again this seems to work, and the API is reasonable. Reference
documentation is somewhat lacking though; in the description of
lines(), the requirements for CPExtentFunc need to be described,
and the "unspecified" return type needs to document that it includes
members including iter and hard_line_break. I think there are
probably similar omissions in much of the reference documentation.

I'm unconvinced by the use of the word "sentinel" here. A sentinel
is a value with a special meaning, e.g. a terminating null; what
you have here is an end-iterator. It's good that you're allowing
the begin and end iterators to have different types, but I think
you should name that type something like CPEndIter.

The implementation of line splitting seems to call its word
length callback more often than necessary, i.e. when a word
overflows a line, that word is subsequently measured a second
time. That should be easily fixable.

Moving on to other parts of the proposal, I'm less happy with
text::string. The first thing I noticed was the very strange
decision to use int, rather than (signed?) size_t, for indexing;
yes I do work with chunks of text larger than 2 GB! I would
REJECT text::string for this alone.

I also dislike the use of operator() to extract substrings; the
meaning just isn't obvious:

auto x = y(4);

Is that the 4th character of y? Or the substring from 4 to the
end? No, it's the substring from the start to 3. I'd need to
add a comment to explain what it's doing. But actually there's
another problem with that line: x is a non-owning view of that
substring. I think that methods that return a view (or span,
etc.) should have a name that makes that clear, e.g. "subview".
(This wouldn't have been a problem before auto.)  This came up in
the static_string review, where the substr() method was originally
proposed to return a view; that was renamed subview() with
substr() returning a copied string.

I don't mind that the string isn't templated on the char type,
but I do miss the allocator support; I've used strings with arena
allocators (for performance) and with shared memory allocators.
I know allocator support is a lot of work but it is necessary, IMO.

text::text seems to be the obvious combination of text::string
with unicode functionality, and I've not looked at that.

I have had a look at the trie; I have sometimes wondered to what
extent a trie can be implemented in C++ that works as much as
possible like a standard container so this is interesting.

The docs should mention briefly why trie is here, i.e. where it
is used, just in anticipation of that obvious question from users.

trie::operator[] returns an "optional ref". But it's not like
a std::optional; as well as operator* it has conversion operators
to access the value. Zach notes that this is "wonky" when the
value type is bool, but it also seems to break for me when it's
int:

boost::trie::trie_map<std::string, std::string> counts;
auto x = counts["Hello"];
if (x) std::cout << "present\n"; else std::cout << "absent\n";
// Prints absent, OK.

boost::trie::trie_map<std::string, int> int_counts;  // Now int.
auto y = int_counts["Hello"];
if (y) std::cout << "present\n"; else std::cout << "absent\n";

// ../text/include/boost/text/trie.hpp:78:
// boost::trie::optional_ref<T, Const>::operator T&() &
// [with T = int; bool Const = false]: Assertion `t_' failed.

operator T&() is being called by if(y), rather than
operator bool(). There aren't has_value() and value() methods
that I can use to work around this.

Like a few other things in this proposal, Zach has implemented
trie::operator[] how he wishes the standard library worked.
As another example, dereferencing a trie_map iterator doesn't
return a pair<KEY,VALUE> as a normal associative container
would, but rather a struct with members called key and value.

At worst, this will hit problems that the standard solutions
avoid (as above). At best, it will make the library more
difficult to learn for users who are familiar with the standard
library (including its problems), and make generic code less
likely to work with it.

As a test I wrote a simple program to count word frequencies:

using counts_t = std::map<std::string,int>;
counts_t counts;

while (std::cin.good()) {
  std::string s;
  std::cin >> s;
  counts[s]++;
}

for (auto i: counts) {
  std::cout << i.first << " = " << i.second << "\n";
}

It would be great if I could just change it to:

using counts_t = boost::trie_map<std::string,int>;

but I can't do that; I need to change things:

struct count { int value = 0; };  // Wrap the int.
boost::trie::trie_map<std::string, count> counts;

while (std::cin.good()) {
  std::string s;
  std::cin >> s;

  // operator[] doesn't insert, so:
  auto r = counts[s];
  if (r) r->value++;
  else counts.insert(s,count{1});
}

for (auto i: counts) {
  std::cout << i.key << " = " << i.value.value << "\n";
}

That works; how about performance? Trie should perform
well when the stored strings frequently share common prefixes,
so I've tested with a large web server log file; this has
timestamps and URLs which share common prefixes. The results:

std::map: run time 1.2 seconds, memory 1007 pages
trie_map: run time 1.8 seconds, memory 4868 pages

So that's rather disappointing.

Some general comments:

Some people care about whether libraries are header-only or not,
and the docs should mention this, and possibly say which
functionality can be expected to work header-only.

I got quite a few missing field initializer and unused parameter
warnings; please fix these if at all possible. (gcc 6.3, -W -Wall).
I think you should re-order the docs so that the "Hello World"
example is much earlier.

Some functions clearly need to embed large Unicode tables; it
would be useful if the docs mentioned roughly how much the
executable size is increased by.

Some source files are executable.

Conclusions:

I fear that the only part of this that Boost should accept is
the UTF transcoding. The implementation is not perfect but we
could work on that. The other Unicode stuff would be good to
have but the copyright issue is a blocker.



Regards, Phil.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
On Thu, Jun 18, 2020 at 11:22 AM Phil Endecott via Boost
<[hidden email]> wrote:
> if I understand correctly, Zach's code includes tables that are
> derived from Unicode data and the licence for that data requires
> a copyright attribution.

Could we ask the owner of the copyright to dual-license the data under
the BSL? That might be a cheap and easy solution.

Thanks

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Thu, 18 Jun 2020 at 13:22, Phil Endecott via Boost <[hidden email]>
wrote:

> I have had a look at the trie; I have sometimes wondered to what
> extent a trie can be implemented in C++ that works as much as
> possible like a standard container so this is interesting.
>

https://github.com/ned14/nedtries : no need to start from scratch, I have
seen the author's name being mentioned here and there, one in a while.
Explosive development without reflection, design and consolidation, amounts
to anarchy.

degski

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
I also am not a lawyer but know from experience that you can't copyright
data, or simple expressions of it. A program is an original work. A table
of facts it uses is not. Someone can write what they like in a licence but
if the table can be (ideally, has been) produced manually or
algorithmically from information (published in whatever form, even if
copyright) that simply conveys facts you can use it. There is some
risk/issue if you use the way the facts are arranged "creatively" e.g. as a
graph or figure or if included in a database (that is organised
"creatively" to allow use/retrieval) etc.. Also copyright does not protect
an underlying algorithm - only the program that implements it. And if there
is simply/only one obvious way to represent said algorithm etc it doesn't
prevent one writing the same thing... Obviously a whole program composed of
fragments that aren't individually copyrightable but "creatively arranged"
in some original program can't then be arranged into a "new" program to
form a whole that is - effectively - the original work without that copying
of creative arrangement being a breach of copyright ...

Also there may be a need to cite sources. and this is only polite -
reasonable anyway -  but that does NOT require things like inclusion of
copyright notices with distributed binaries etc. Citing the source in the
comments/ docs / readme other human readable content of the work (the
source code) is sufficient.

I have purposely not, so far, looked at the specifics of this (original
source of this table or usage in proposed boost.text). I might be being
naive in my assumption that this is a relatively simple expression of facts.
It might be necessary for the removal of all doubt to produce a boost table
and code that uses it from primary sources and descriptions (not programs)
re how the table/algorithm is used/works.

On Fri, 19 Jun 2020, 4:25 am Vinnie Falco via Boost, <[hidden email]>
wrote:

> On Thu, Jun 18, 2020 at 11:22 AM Phil Endecott via Boost
> <[hidden email]> wrote:
> > if I understand correctly, Zach's code includes tables that are
> > derived from Unicode data and the licence for that data requires
> > a copyright attribution.
>
> Could we ask the owner of the copyright to dual-license the data under
> the BSL? That might be a cheap and easy solution.
>
> Thanks
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
El vie., 19 jun. 2020 a las 0:45, Darryl Green via Boost (<
[hidden email]>) escribió:

> I also am not a lawyer but know from experience that you can't copyright
> data, or simple expressions of it. A program is an original work. A table
> of facts it uses is not. Someone can write what they like in a licence but
> if the table can be (ideally, has been) produced manually or
> algorithmically from information (published in whatever form, even if
> copyright) that simply conveys facts you can use it. There is some
> risk/issue if you use the way the facts are arranged "creatively" e.g. as a
> graph or figure or if included in a database (that is organised
> "creatively" to allow use/retrieval) etc.. Also copyright does not protect
> an underlying algorithm - only the program that implements it. And if there
> is simply/only one obvious way to represent said algorithm etc it doesn't
> prevent one writing the same thing... Obviously a whole program composed of
> fragments that aren't individually copyrightable but "creatively arranged"
> in some original program can't then be arranged into a "new" program to
> form a whole that is - effectively - the original work without that copying
> of creative arrangement being a breach of copyright ...
>
> The problem is that what is copyright, what is copyrightable, and what is
legal or illegal is defined in a country by country basis. Exploring a WW
application/interpretation of law is extremely expensive and that is why
when an org settles on a license it sticks to it to keep it simple and
avoid risk of overlooking something in the process and get in big trouble
somewhere.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
I seem to be agreeing with you re sticking to Boost licence. Only
suggesting what can be done to allow the library to do that. The concepts I
described apply in many jurisdictions (I don't presume to know all of them).

However for certitude the options are more limited.

One is to "clean room" functionally equivalent table/code given
information/primary sources needed to do so (i.e. the spec rather than any
machine usable form of it). This is potentially impractical to do and
maintain (including testing that all is correct and consistent with unicode
spec).

Another is to NOT include the unicode.org headers/data in the lib itself
but to have a compile time dependency on them, some doc's re downloading or
referencing them where (potentially) already installed on a system, and ref
to the unicode licence, which each user is free to assess for themselves.
This approach avoids the need to make the (probably invalid in most
jurisdictions) "act of download is unequivocal acceptance" licence text
part of a boost distribution and avoiding a need to build equivalent data
independently or argue about licence validity, copyright of tabulated data
necessary to conform to a spec etc.. Each Boost user can make that call
without involving Boost. For this potential user at least, that is fine.

Are there any other options? A lib that has to deliver conformance with a
spec that has licence terms should not be rejected due to its dependency on
that spec. Neither should depending on that spec cause any Boost code,
doc's etc to have a different licence. I'm out of suggestions for achieving
that.


On Sat, 20 Jun 2020, 12:37 am Damian Vicino, <[hidden email]> wrote:

>
>
> El vie., 19 jun. 2020 a las 0:45, Darryl Green via Boost (<
> [hidden email]>) escribió:
>
>> I also am not a lawyer but know from experience that you can't copyright
>> data, or simple expressions of it. A program is an original work. A table
>> of facts it uses is not. Someone can write what they like in a licence but
>> if the table can be (ideally, has been) produced manually or
>> algorithmically from information (published in whatever form, even if
>> copyright) that simply conveys facts you can use it. There is some
>> risk/issue if you use the way the facts are arranged "creatively" e.g. as
>> a
>> graph or figure or if included in a database (that is organised
>> "creatively" to allow use/retrieval) etc.. Also copyright does not protect
>> an underlying algorithm - only the program that implements it. And if
>> there
>> is simply/only one obvious way to represent said algorithm etc it doesn't
>> prevent one writing the same thing... Obviously a whole program composed
>> of
>> fragments that aren't individually copyrightable but "creatively arranged"
>> in some original program can't then be arranged into a "new" program to
>> form a whole that is - effectively - the original work without that
>> copying
>> of creative arrangement being a breach of copyright ...
>>
>> The problem is that what is copyright, what is copyrightable, and what is
> legal or illegal is defined in a country by country basis. Exploring a WW
> application/interpretation of law is extremely expensive and that is why
> when an org settles on a license it sticks to it to keep it simple and
> avoid risk of overlooking something in the process and get in big trouble
> somewhere.
>

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 19/06/2020 03:01, degski via Boost wrote:

> On Thu, 18 Jun 2020 at 13:22, Phil Endecott via Boost <[hidden email]>
> wrote:
>
>> I have had a look at the trie; I have sometimes wondered to what
>> extent a trie can be implemented in C++ that works as much as
>> possible like a standard container so this is interesting.
>>
>
> https://github.com/ned14/nedtries : no need to start from scratch, I have
> seen the author's name being mentioned here and there, one in a while.
> Explosive development without reflection, design and consolidation, amounts
> to anarchy.

Alas nedtries is a *bitwise* trie implementation, not a character based
one. A bitwise trie indexing algorithm is superb for associative
containers where the key is a bunch of bits small enough to allow the
CPU's bit scan reverse opcode to work on them (i.e. as an unordered_map
or map replacement), but I'm not sure that that is directly useful for
UTF-8 parsing.

Niall

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Thu, Jun 18, 2020 at 1:25 PM Vinnie Falco via Boost
<[hidden email]> wrote:
>
> On Thu, Jun 18, 2020 at 11:22 AM Phil Endecott via Boost
> <[hidden email]> wrote:
> > if I understand correctly, Zach's code includes tables that are
> > derived from Unicode data and the licence for that data requires
> > a copyright attribution.
>
> Could we ask the owner of the copyright to dual-license the data under
> the BSL? That might be a cheap and easy solution.

Maybe?  I'll email the one ICU developer I've ever spoken to, and see
if he can direct me to the right person to ask about this.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
On Sun, 21 Jun 2020, 7:42 am Zach Laine via Boost,
<[hidden email]> wrote:

>
> On Thu, Jun 18, 2020 at 1:25 PM Vinnie Falco via Boost
> <[hidden email]> wrote:
> >
> > On Thu, Jun 18, 2020 at 11:22 AM Phil Endecott via Boost
> > <[hidden email]> wrote:
> > > if I understand correctly, Zach's code includes tables that are
> > > derived from Unicode data and the licence for that data requires
> > > a copyright attribution.
> >
> > Could we ask the owner of the copyright to dual-license the data under
> > the BSL? That might be a cheap and easy solution.
>
> Maybe?  I'll email the one ICU developer I've ever spoken to, and see
> if he can direct me to the right person to ask about this.

The use that was made of the .txt data files to extract the
information needed to produce headers that are part of Boost.Text is
in my opinion fine. Boost.Text users never use or need access to these
.txt files.

If one takes the contrary view that the licence of these files is
something that the user of Boost.Text needs to care about, then be
aware that it is unlikely any single person or entity can re-license
these files as I believe they may have various contributors/licences.
I would try to avoid pulling on that string for fear of the tangle
that is likely at the end of it.

That said, it might be possible to ask ICU for explicit authorisation
(to the extent they alone can give it) for the use of the .txt files
as a source of constant data required to process unicode in accordance
with the standard without requiring that headers in Boost.Text that
embed those constants, for that purpose, be licenced under the ICU
licence. One reasonably expects that was their intent in producing
these files, but it doesn't hurt to ask.

With or without that confirmation it would be good practice to cite
the source of the DATA values (not code) in Boost.Text documentation
and/or in the headers that embed the constants.

>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Thu, Jun 18, 2020 at 1:22 PM Phil Endecott via Boost
<[hidden email]> wrote:
>
> This is my review of Zach Laine's proposed Boost.Text.

Thanks for the review, Phil.

> Zach is offering several things here: Unicode, a "better string",
> a trie, and more. I'm pleased that these components are all
> individually exposed to users and are independent of each other,
> rather than some being hidden as implementation details.
>
> I've spent a couple of days looking at a few parts of the library.
> I'll discuss at the Unicode stuff first.
>
> Boost gained support for Unicode functionality when Boost.Locale
> was accepted almost a decade ago, but I was not particularly
> impressed by that library and have not used it in my own work.
> Does the current proposal do better? I think it does, and
> though it's not perfect I think it would be worth accepting but
> subject to a concern about copyright.
>
> See another thread for the copyright discussion. Basically,
> if I understand correctly, Zach's code includes tables that are
> derived from Unicode data and the licence for that data requires
> a copyright attribution. The Boost licence does not require
> attribution and I support that. So if I've understood the
> situation correctly - and I am not a lawyer! - this means that
> sadly Boost must REJECT the affected code.

Agreed.

> I'll say some more about the other aspects of the Unicode code,
> just in case the copyright issue turns out to be surmountable.

Me too.  I've sent an email to someone at ICU, as alluded to earlier
in this thread.

> I've tried just a couple of things. Firstly, I tried the UTF-8
> to UTF-32 conversion. This seems to work, and the API is
> reasonable. Performance is not bad, but it could be better
> (see another thread); that could be fixed later. I think it
> would be worth having just a UTF library in Boost!

Could you provide the exact scenario you tried, and point me to the
conversion technique that you used that was faster?

> Secondly I looked at the line-splitting algorithm, i.e.
>
> template<typename CPIter, typename Sentinel, typename Extent,
>          typename CPExtentFunc>
>   unspecified lines(CPIter first, Sentinel last, Extent max_extent,
>                     CPExtentFunc cp_extent, bool break_overlong_lines = true);
>
> Again this seems to work, and the API is reasonable. Reference
> documentation is somewhat lacking though; in the description of
> lines(), the requirements for CPExtentFunc need to be described,
> and the "unspecified" return type needs to document that it includes
> members including iter and hard_line_break. I think there are
> probably similar omissions in much of the reference documentation.

Thanks!  I think all those omissions in the reference docs are covered
by the tutorial docs, but yeah, I definitely need to add explanations
there.

> I'm unconvinced by the use of the word "sentinel" here. A sentinel
> is a value with a special meaning, e.g. a terminating null; what
> you have here is an end-iterator. It's good that you're allowing
> the begin and end iterators to have different types, but I think
> you should name that type something like CPEndIter.

This is standard terminology, as of C++20.  See
http://eel.is/c++draft/ranges for numerous uses.  It means
end-iterator-or-other-type-that-acts-as-the-end-tag.

> The implementation of line splitting seems to call its word
> length callback more often than necessary, i.e. when a word
> overflows a line, that word is subsequently measured a second
> time. That should be easily fixable.

Thanks, I'll revisit that.

> Moving on to other parts of the proposal, I'm less happy with
> text::string. The first thing I noticed was the very strange
> decision to use int, rather than (signed?) size_t, for indexing;
> yes I do work with chunks of text larger than 2 GB! I would
> REJECT text::string for this alone.

Sure.  You're in good company.

> I also dislike the use of operator() to extract substrings; the
> meaning just isn't obvious:
>
> auto x = y(4);

I think this is true because of lack of familiarity.  Seeing this a
few times would make it familiar pretty quickly.  That being said, I'd
prefer if I could write y[0:4] or y[0,4] but I can't, at least not
without using some kind of strong typedef and a UDL, like
y[0_something,4].  Anyway, the string types are moot at this point.

> Is that the 4th character of y? Or the substring from 4 to the
> end? No, it's the substring from the start to 3. I'd need to
> add a comment to explain what it's doing. But actually there's
> another problem with that line: x is a non-owning view of that
> substring. I think that methods that return a view (or span,
> etc.) should have a name that makes that clear, e.g. "subview".
> (This wouldn't have been a problem before auto.)  This came up in
> the static_string review, where the substr() method was originally
> proposed to return a view; that was renamed subview() with
> substr() returning a copied string.

Again, this is a question of familiarity.  If std::string did not have
substr() (as text::string does not) there would not be confusion.
Silent memory allocations (substr() semantics) are a larger footgun
than dangling references (op() semantics) in the age of ASan, IMO.

> I don't mind that the string isn't templated on the char type,
> but I do miss the allocator support; I've used strings with arena
> allocators (for performance) and with shared memory allocators.
> I know allocator support is a lot of work but it is necessary, IMO.

I disagree.  Allocators are a poor design, in that they are too
general.  Consider std::vector.  An insert operation has different
optimal implementations depending on whether the vector is a fixed
capacity, inline-storage type or a heap-allocated type.  An allocator
changed the allocation strategy, but leaves all the other code that
often depends on the allocation semantics unchanged.  A better design
is to have a separate templates for inline storage (e.g.
boost::container::static_vector), sbo-optimized heap storage (e.g.
boost::container::small_vector), and heap storage (e.g. std::vector).
Many other libraries have come to the same conclusion, including FB
Folly, LLVM's internal lib, and more.

> text::text seems to be the obvious combination of text::string
> with unicode functionality, and I've not looked at that.

If you have time, please take a look.  text::text is almost certainly
soon to be reimplemented in terms of std::string.

> I have had a look at the trie; I have sometimes wondered to what
> extent a trie can be implemented in C++ that works as much as
> possible like a standard container so this is interesting.
>
> The docs should mention briefly why trie is here, i.e. where it
> is used, just in anticipation of that obvious question from users.
>
> trie::operator[] returns an "optional ref". But it's not like
> a std::optional; as well as operator* it has conversion operators
> to access the value. Zach notes that this is "wonky" when the
> value type is bool, but it also seems to break for me when it's
> int:
>
> boost::trie::trie_map<std::string, std::string> counts;
> auto x = counts["Hello"];
> if (x) std::cout << "present\n"; else std::cout << "absent\n";
> // Prints absent, OK.
>
> boost::trie::trie_map<std::string, int> int_counts;  // Now int.
> auto y = int_counts["Hello"];
> if (y) std::cout << "present\n"; else std::cout << "absent\n";
>
> // ../text/include/boost/text/trie.hpp:78:
> // boost::trie::optional_ref<T, Const>::operator T&() &
> // [with T = int; bool Const = false]: Assertion `t_' failed.

Thanks!  I was missing an operator bool() overload for *mutable*
lvalue ref.  It works when I add that.

> operator T&() is being called by if(y), rather than
> operator bool(). There aren't has_value() and value() methods
> that I can use to work around this.
>
> Like a few other things in this proposal, Zach has implemented
> trie::operator[] how he wishes the standard library worked.
> As another example, dereferencing a trie_map iterator doesn't
> return a pair<KEY,VALUE> as a normal associative container
> would, but rather a struct with members called key and value.
>
> At worst, this will hit problems that the standard solutions
> avoid (as above). At best, it will make the library more
> difficult to learn for users who are familiar with the standard
> library (including its problems),

I think Boost should be a place to experiment with approaches to
addressing those problems, rather than propagating them.  I find
.first and .second to be horrible names for the key-value pair used to
represent the mapping of a std::map.  I don't think it's appropriate
for new code.  Code that uses structured bindings should not be
affected.

> and make generic code less
> likely to work with it.
>
> As a test I wrote a simple program to count word frequencies:
>
> using counts_t = std::map<std::string,int>;
> counts_t counts;
>
> while (std::cin.good()) {
>   std::string s;
>   std::cin >> s;
>   counts[s]++;
> }
>
> for (auto i: counts) {
>   std::cout << i.first << " = " << i.second << "\n";
> }
>
> It would be great if I could just change it to:
>
> using counts_t = boost::trie_map<std::string,int>;
>
> but I can't do that; I need to change things:
>
> struct count { int value = 0; };  // Wrap the int.
> boost::trie::trie_map<std::string, count> counts;
>
> while (std::cin.good()) {
>   std::string s;
>   std::cin >> s;
>
>   // operator[] doesn't insert, so:
>   auto r = counts[s];
>   if (r) r->value++;
>   else counts.insert(s,count{1});
> }
>
> for (auto i: counts) {
>   std::cout << i.key << " = " << i.value.value << "\n";
> }
>
> That works; how about performance? Trie should perform
> well when the stored strings frequently share common prefixes,
> so I've tested with a large web server log file; this has
> timestamps and URLs which share common prefixes. The results:
>
> std::map: run time 1.2 seconds, memory 1007 pages
> trie_map: run time 1.8 seconds, memory 4868 pages
>
> So that's rather disappointing.

First, it's important to note that frequent common prefixes don't
necessarily make the trie more lookup-efficient.  If you are looking
for "fred" and you actually have a trie containing f -> r- > e -> d as
nodes, you'll visit all 4 nodes as you look up "fred".  For a string
of 10 chars, you'll visit all 10 nodes for a match, etc.  For an R/B
tree like std::map, you'll only visit log(N) nodes, where N is the
size() of the map.  This is unlikely to be nearly as large as the
longest string for many data sets.

The docs explicitly state that the trie_map and trie_set templates are
there for convenience, and that if you care about performance you
should use trie.  Moreover, if you're trying to eat the longest token
that matches some list of tokens encoded in a trie, the trie will be a
lot faster, since you don't have to redo a log(N) search M times,
where M is the length of the token you're eating.

If you're using a trie outside of its niche, you should not expect
ideal outcomes.  You should use the thing that actually delivers ideal
outcomes in that case you care about.

> Some general comments:
>
> Some people care about whether libraries are header-only or not,
> and the docs should mention this, and possibly say which
> functionality can be expected to work header-only.
>
> I got quite a few missing field initializer and unused parameter
> warnings; please fix these if at all possible. (gcc 6.3, -W -Wall).
> I think you should re-order the docs so that the "Hello World"
> example is much earlier.
>
> Some functions clearly need to embed large Unicode tables; it
> would be useful if the docs mentioned roughly how much the
> executable size is increased by.
>
> Some source files are executable.
>
> Conclusions:
>
> I fear that the only part of this that Boost should accept is
> the UTF transcoding. The implementation is not perfect but we
> could work on that. The other Unicode stuff would be good to
> have but the copyright issue is a blocker.


Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
On Mon, 22 Jun 2020 at 09:32, Zach Laine via Boost
<[hidden email]> wrote:

> > I don't mind that the string isn't templated on the char type,
> > but I do miss the allocator support; I've used strings with arena
> > allocators (for performance) and with shared memory allocators.
> > I know allocator support is a lot of work but it is necessary, IMO.
>
> I disagree.  Allocators are a poor design, in that they are too
> general.  Consider std::vector.  An insert operation has different
> optimal implementations depending on whether the vector is a fixed
> capacity, inline-storage type or a heap-allocated type.  An allocator
> changed the allocation strategy, but leaves all the other code that
> often depends on the allocation semantics unchanged.  A better design
> is to have a separate templates for inline storage (e.g.
> boost::container::static_vector), sbo-optimized heap storage (e.g.
> boost::container::small_vector), and heap storage (e.g. std::vector).
> Many other libraries have come to the same conclusion, including FB
> Folly, LLVM's internal lib, and more.

I don't claim to understand the details of insertion for different-type vectors,
but the container<->allocator interface certainly cannot really communicate
a whole lotta things that a container might want to tell the allocator, like
"I'm going to be replacing this old allocation with this new allocation", and
"I'm going to be allocating this many elements with individual allocations,
but I'm not seeking to allocate an array". Such things can be communicated
straight-to-the-allocator in an allocator-specific way outside the container's
control, but there's no generic interface for that sort of hinting, which has
always seemed quite funny to me, considering that memory allocation
is the allocator's responsibility, but things that might have an immense
impact on the proper allocation strategy that should be selected are not hinted
to the allocator. And they're not necessarily static.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Sun, Jun 21, 2020 at 11:32 PM Zach Laine via Boost
<[hidden email]> wrote:
> Allocators are a poor design, in that they are too general.

I agree that containers could do better than the general Allocator
interface. But it does not then follow that making a container capable
of using only the default allocator is an improvement.

Regards

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
Hi,

Il 18/06/20 20:21, Phil Endecott via Boost ha scritto:
> See another thread for the copyright discussion. Basically,
> if I understand correctly, Zach's code includes tables that are
> derived from Unicode data and the licence for that data requires
> a copyright attribution. The Boost licence does not require
> attribution and I support that. So if I've understood the
> situation correctly - and I am not a lawyer! - this means that
> sadly Boost must REJECT the affected code.

Are these tables the same that are embedded in Boost.Spirit headers? I
am referring to files in
libs/spirit/include/boost/spirit/home/support/char_encoding/unicode. If
those files are problematic, then probably something should be done
about Boost.Spirit as well.

Also, it seems those files are more than 10 years old now. They should
probably be updated, or, even better, not distributed at all with Boost
(and the headers should be regenerated at compile time starting from
updated Unicode description files, possibly those already available with
the operating system when it is the case).

Giovanni.
--
Giovanni Mascellani <[hidden email]>
Postdoc researcher - Université Libre de Bruxelles



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

signature.asc (235 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
Giovanni Mascellani wrote:

> Hi,
>
> Il 18/06/20 20:21, Phil Endecott via Boost ha scritto:
>> See another thread for the copyright discussion. Basically,
>> if I understand correctly, Zach's code includes tables that are
>> derived from Unicode data and the licence for that data requires
>> a copyright attribution. The Boost licence does not require
>> attribution and I support that. So if I've understood the
>> situation correctly - and I am not a lawyer! - this means that
>> sadly Boost must REJECT the affected code.
>
> Are these tables the same that are embedded in Boost.Spirit headers? I
> am referring to files in
> libs/spirit/include/boost/spirit/home/support/char_encoding/unicode. If
> those files are problematic, then probably something should be done
> about Boost.Spirit as well.

Well spotted.  See also:

https://github.com/boostorg/spirit/blob/develop/workbench/unicode/DerivedCoreProperties.txt

That starts:

# DerivedCoreProperties-5.2.0.txt
# Date: 2009-08-26, 00:45:22 GMT [MD]
#
# Unicode Character Database
# Copyright (c) 1991-2009 Unicode, Inc.
# For terms of use, see http://www.unicode.org/terms_of_use.html
# For documentation, see http://www.unicode.org/reports/tr44/

# It is ok to redistribute this file "solely for informational
# purposes in the creation of products supporting the Unicode Standard".
# We don't nee to add a Boost License to this file: boostinspect:nolicense.


Note that I don't know who wrote that, nor where the "solely for
informational purposes..." quote comes from.  And what are
"informational purposes" anyway? - it doesn't sound to me as
if that were written by a lawyer.

This really needs an expert to look at it, IMO.


Regards, Phil.





_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Boost.Text review

Boost - Dev mailing list
On 26/06/2020 00:07, Phil Endecott wrote:

> # It is ok to redistribute this file "solely for informational
> # purposes in the creation of products supporting the Unicode Standard".
> # We don't nee to add a Boost License to this file: boostinspect:nolicense.
>
> Note that I don't know who wrote that, nor where the "solely for
> informational purposes..." quote comes from.  And what are
> "informational purposes" anyway? - it doesn't sound to me as
> if that were written by a lawyer.
>
> This really needs an expert to look at it, IMO.

That text appears at https://spdx.org/licenses/Unicode-TOU.html

Although that is perhaps an older version of the Unicode license, as it
doesn't match what is at https://www.unicode.org/license.html (or at
http://www.unicode.org/copyright.html)

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost