[histogram] Formal review

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[histogram] Formal review

Boost - Dev mailing list
Boost.Histogram contains useful and well-designed features, like excess
(over/underflow) bins, and axis transforms to name a few.

As this is a review, I am going to spent most time below on critical
scrutiny.


I. DESIGN
---------

Iterators are const. This prevents us from bootstrapping the histogram
with a prior distribution using STL algorithms like std::fill(),
std::generate(), or std::sample().

The adaptive_storage does not work well with STL algorithms, because
weight_counter is not an arithmetic type. See the Implementation section
below for more detail. I therefore propose that the default storage
policy is changed to something without weight_counter.

The adaptive_storage has two responsibilities: data compaction and
weights. Would it be possible to split this into two separate storage
policies? I have no real use for arrival variance.

I am not too fond of using operator() for insertion. The code looks like
a function call with side-effect. I prefer an explicitly named member
function.

The axis::ouflow_type is oddly named. I suggest that this is renamed to
something like axis::excess_type.


II. IMPLEMENTATION
------------------

The implementation is generally of high quality. However, I did
encounter the three problem areas listed below.

(A) Integration with STL algorithms can be improved. Here are some
examples:

First, std::distance() does not work on a histogram with array_storage.
This means that other STL algorithms, like std::any_of(), fails to
compile. I solved this problem by copying the distance_to() function
from the axis iterator to the histogram iterator.

Second, std::max_element (and brethren) cannot be used on a histogram.
Consider the following example:

   auto element = std::max_element(h.begin(), h.end());

Compilation fails for adaptive_storage because weight_counter has no
less-than operator. Furthermore, compilation fails for array_storage
because iterator_over<H>::operator= fails. It attempts to re-assign
a reference, but accidentally triggers a copy-constructor instead.
Letting the iterator store a pointer instead of a reference solves the
problem.

Apropos, iterator_over<H>::operator= does not return *this. Consider
adding the -Werror=return-type compiler flag and a unit-test to that
calls this operator.

Third, std::inner_product fails to compile for adaptive_storage because
weight_counter does not have a binary operator*. The inner product of
two histograms is useful for calculating the cosine similarity, which
can be used to compare two distributions. std::inner_product works for
array_storage though.

(B) Indexing and size use different types (int versus std::size_t.) I
assume that this is because of the underflow bin, which is indexed by
-1. I am not certain whether or not this is a real problem, but it does
cause some oddities when stressed to the limit. For example, if we need
a histogram that counts each unsigned int, then we get an "lower <
upper" exception during construction:

   using range_type = unsigned int;
   auto h = make_static_histogram_with(
     array_storage<int>(),
     axis::integer<>(0, std::numeric_limits<range_type>::max()));

It works if we reduce the end by 2, but then we are not collecting all
values (unless we shift the range left by 1 and misuse the excess bins
for the two missing values.)

A possible solution could be to replace the -1 and N excess indices with
an "enum struct excess { lower, upper }" and let operator[] et al
use overloading on this enum.

(C) Sometimes the compilation errors are nonsensical. For example:

   // Forgot to use make_static_histogram_with()
   auto h = make_static_histogram(array_storage<int>,
                                  axis::regular<>(10, 0, 1));

triggers an incomprehensible static_assert plus an error about a missing
.shape() function.


III. DOCUMENTATION
------------------

User guide is clear and pedagogical.

I would like to see more examples that uses STL algorithms, such as
calculating the CDF using std::partial_sum(), or calculating the cosine
similarity of two histograms using std::inner_product().

Most examples use std::cout plus a comment to document the results of
operations. Consider using assert() instead.

Consider using a tabular layout similar to that of the C++ standard (or
cppreference.com for that matter) on the Concepts page.

Reference documentation is rather meager and shows implementation
details.

The documentation contains both a "Reference" and a "References" chapter
which are completely different. Consider renaming the latter to
something like "External references" , "Literature references", or
"Bibliography".


IV. MISC
--------

I have spent around 15 hours on the review, mainly writing small
examples that uses STL algorithms on Boost.Histogram.

I am well-versed in the topic. I work with statistical distributions
for real-time analysis of data, although I mainly dabble around in one
dimension. I have written a library for online/streaming statistical
algorithms.


V. VERDICT
----------

While Boost.Histogram is a small niche library, it has a wide range of
practical applications to warrant the inclusion into Boost.

Boost.Histogram should be ACCEPTED into Boost, provided:

   1. Reference documentation is finalized.

I furthermore strongly recommend that the default storage policy is
changed to something without weight_counter because of the various
problems with STL algorithms. This recommendation is not a prerequisite
for acceptance.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Formal review

Boost - Dev mailing list
Dear Bjørn,

> On 23. Sep 2018, at 16:26, Bjorn Reese via Boost <[hidden email]> wrote:
>
> Boost.Histogram contains useful and well-designed features, like excess
> (over/underflow) bins, and axis transforms to name a few.
>
> As this is a review, I am going to spent most time below on critical
> scrutiny. […]

thank you very much for the great review and the time spend with the library. Some discussion below.

> I. DESIGN
> ---------
>
> Iterators are const. This prevents us from bootstrapping the histogram
> with a prior distribution using STL algorithms like std::fill(),
> std::generate(), or std::sample().

Writable iterators could be added, but I am not convinced that is a good idea.

The histogram iterators are const, because I cannot imagine a use-case where the histogram has to be bootstrapped with a prior distribution. Only providing const iterators avoids the misconception that a histogram is a container. Histogram can make stronger guarantees about its internal consistency, if users cannot set bin counters arbitrarily.

> The adaptive_storage does not work well with STL algorithms, because
> weight_counter is not an arithmetic type. See the Implementation section
> below for more detail. I therefore propose that the default storage
> policy is changed to something without weight_counter.

There are two solutions, right? Turing weight_counter into a full arithmetic type or removing support for weight_counter in the default storage.

> The adaptive_storage has two responsibilities: data compaction and
> weights. Would it be possible to split this into two separate storage
> policies? I have no real use for arrival variance.

Agreed, it should be a compile-time option for adaptive_storage.
https://github.com/HDembinski/histogram/issues/75

> I am not too fond of using operator() for insertion. The code looks like
> a function call with side-effect. I prefer an explicitly named member
> function.

Originally, I had a special method called fill(), then I switched to operator(). There are two arguments in favour of operator():
- a 1D histogram can be filled with std::for_each or std::accumulate
- consistency with Boost.Accumulators, which also use operator()

> The axis::ouflow_type is oddly named. I suggest that this is renamed to
> something like axis::excess_type.

I am ok with renaming, but I think we haven't found the best name yet. What do you think about axis::tails_type?

> II. IMPLEMENTATION
> —————————
> […]

Agreed, the missing bits will be implemented and unit-tested.

> III. DOCUMENTATION
> —————————
> […]

Agreed, documentation will be expanded in this way.

> Boost.Histogram should be ACCEPTED into Boost, provided:
>
>  1. Reference documentation is finalized.
>
> I furthermore strongly recommend that the default storage policy is
> changed to something without weight_counter because of the various
> problems with STL algorithms. This recommendation is not a prerequisite
> for acceptance.

Thanks!


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Formal review

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 09/24/18 23:52, Hans Dembinski wrote:

> The histogram iterators are const, because I cannot imagine a use-case where the histogram has to be bootstrapped with a prior distribution. Only providing const iterators avoids the misconception that a histogram is a container. Histogram can make stronger guarantees about its internal consistency, if users cannot set bin counters arbitrarily.

If you are using histograms to collect data for later analysis, then
there may indeed not be much use for a prior distribution. However, if
you use histograms to make decisions on-the-fly, then you cannot always
wait for the histogram to be filled up with data in order to make
reliable decisions. In such cases a prior distribution can help.

Admittedly, I have not done this with histograms, but I do use CDFs
(well, quantile approximations) quite a lot for on-the-fly decision-
making.

The other use case is to restore the histogram with data from an
earlier run, but serialization already covers that.

> There are two solutions, right? Turing weight_counter into a full arithmetic type or removing support for weight_counter in the default storage.

True, but then I am going to play the zero-overhead card :)

> Originally, I had a special method called fill(), then I switched to operator(). There are two arguments in favour of operator():
> - a 1D histogram can be filled with std::for_each or std::accumulate
> - consistency with Boost.Accumulators, which also use operator()

Inserting data into the histogram with std::accumulate is odd.

An alternative is to use std::copy (or std::transform) with
std::inserter. That does requires a histogram::insert(iterator hint,
value_type value) where the hint is completely useless... but that
is what std::set is using.

> I am ok with renaming, but I think we haven't found the best name yet. What do you think about axis::tails_type?

Seems like you settled on options in another thread. I am fine with
that.

Related to this, I also like the extent() over shape(). This name is
also used by Boost.MultiArray and the std::mdspan proposal (P0009R7).
In the name of consistency, you may also consider renaming dim() to
rank().

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Formal review

Boost - Dev mailing list

> On 30. Sep 2018, at 17:29, Bjorn Reese via Boost <[hidden email]> wrote:
>
> On 09/24/18 23:52, Hans Dembinski wrote:
>
>> The histogram iterators are const, because I cannot imagine a use-case where the histogram has to be bootstrapped with a prior distribution. Only providing const iterators avoids the misconception that a histogram is a container. Histogram can make stronger guarantees about its internal consistency, if users cannot set bin counters arbitrarily.
>
> If you are using histograms to collect data for later analysis, then
> there may indeed not be much use for a prior distribution. However, if
> you use histograms to make decisions on-the-fly, then you cannot always
> wait for the histogram to be filled up with data in order to make
> reliable decisions. In such cases a prior distribution can help.
>
> Admittedly, I have not done this with histograms, but I do use CDFs
> (well, quantile approximations) quite a lot for on-the-fly decision-
> making.
>
> The other use case is to restore the histogram with data from an
> earlier run, but serialization already covers that.

I plan to add a struct `unsafe_access` which allows unsafe raw access to the bin counters, like so

auto& storage = unsafe_access::storage(some_histogram);

This will be part of the public interface, but is targeted towards advanced users. It will be used by algorithms that transform histograms and by serialization code. I also want to support other serialization libraries than Boost.Serialization, and unsafe_access makes this easier. With `unsafe_access`, advanced users get full control, but the standard interface still protects from trivial usage errors. The docs of `unsafe_access` then basically say: if you use this, you are on your own. It could be used to set the histogram with a prior, which seems like an advanced use case.

>> There are two solutions, right? Turing weight_counter into a full arithmetic type or removing support for weight_counter in the default storage.
>
> True, but then I am going to play the zero-overhead card :)

Fair enough.
https://github.com/HDembinski/histogram/issues/75

The default value would then be without weight support.

>> Originally, I had a special method called fill(), then I switched to operator(). There are two arguments in favour of operator():
>> - a 1D histogram can be filled with std::for_each or std::accumulate
>> - consistency with Boost.Accumulators, which also use operator()
>
> Inserting data into the histogram with std::accumulate is odd.

I was wrong, std::accumulate uses += to fill, which is not supported.

> An alternative is to use std::copy (or std::transform) with
> std::inserter. That does requires a histogram::insert(iterator hint,
> value_type value) where the hint is completely useless... but that
> is what std::set is using.

histogram is not a container, so it should not support std::inserter. histogram is a accumulator and accumulators have a functor design, as established by Boost.Accumulator. They eat samples and update their internal state, there is an implied transformative aspect. Inserting a value implies that you can exactly retrieve it later, which is not the case here.

>> I am ok with renaming, but I think we haven't found the best name yet. What do you think about axis::tails_type?
>
> Seems like you settled on options in another thread. I am fine with
> that.
>
> Related to this, I also like the extent() over shape(). This name is
> also used by Boost.MultiArray and the std::mdspan proposal (P0009R7).
> In the name of consistency, you may also consider renaming dim() to
> rank().

Good suggestion, thank you!
https://github.com/HDembinski/histogram/issues/112

Best regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost