[histogram] Working with standard algorithms

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[histogram] Working with standard algorithms

Boost - Dev mailing list
I would like to see a better integration between boost.histogram and
standard algorithms, especially those from <numeric>. Today they do not
work together because the dereference operator of the axis iterator
returns a pair (index and content) rather than the content.

An example of where standard algorithms could be useful is calculating
the cumulative distribution of a one-dimensional histogram, which ought
to be a simple as:

   auto& axis = h.axis();
   std:partial_sum(axis.begin(), axis.end(), axis.begin());

Another example is ranking a set of histograms based on the cosine
similarity [1] using std::inner_product().

[1] https://en.wikipedia.org/wiki/Cosine_similarity

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Working with standard algorithms

Boost - Dev mailing list
Dear Bjørn,

> On 18. Nov 2017, at 17:08, Bjorn Reese via Boost <[hidden email]> wrote:
>
> I would like to see a better integration between boost.histogram and
> standard algorithms, especially those from <numeric>. Today they do not
> work together because the dereference operator of the axis iterator
> returns a pair (index and content) rather than the content.
>
> An example of where standard algorithms could be useful is calculating
> the cumulative distribution of a one-dimensional histogram, which ought
> to be a simple as:
>
>  auto& axis = h.axis();
>  std:partial_sum(axis.begin(), axis.end(), axis.begin());
>
> Another example is ranking a set of histograms based on the cosine
> similarity [1] using std::inner_product().

I totally see the point for partial_sum. Could you also provide a code example for how you would like to compute the cosine similarity?

I think we need several iterators then. The current axis iterators iterate over the bins definition, not really over the bin content of the histogram itself. I chose the current iterators in this way, because I believe this scheme generalises well to the multi-dimensional case. I wouldn't replace these iterators, because they seem useful to me, but additional iterators that iterate over the bin values and play well with algorithms are also clearly needed.

How should we define such iterators so that they generalise to multiple dimensions? For example, it would be easy to implement new iterators which just iterates over the bin content:

auto h = make_static_histogram(axis::regular(3, 0, 1));
// fill […], then
std::vector<double> cumulative;
std::partial_sum(h.begin(), h.end(), cumulative.begin(), std::back_inserter(cumulative));

Questions:
- What should this do if h is 2d? The same? If yes, in which order should the 2d array be iterated over?

The current design keeps the internal order an implementation detail (I use column-major internally, because it is easier to compute the strides on the fly), the two choices are fortran-like (column-major) or C-like (row-major). Perhaps provide both with clear names so it is obvious? Or just row-major, because that's the standard in C++? These questions must have been already discussed during the design of Boost.MultiArray. I think I should provide a way to provide a MultiArray copy of the bin content, thus leveraging all the work that already went into defining MultiArray. Similarly one could provide a read-only view of the variance estimates. A read-only MultiArray view of the internal bin counts of the histogram is also possible for array_storage policy, but not for the default adaptive_storage policy.

- Writable iterators?
In your example, you use writable iterators, to change the content (and the meaning of the content) of the histogram itself. In my reply here, I use read-only iterators, because changing the content of the histogram seems to go against the semantics. A histogram is more specific than a general container of values, the content has a strong meaning that cannot be arbitrarily changed. To be precise, what should the variance estimate, returned by `histogram::variance(…)` be for a cumulative count? The variance estimate for cumulative count is a whole matrix of covariances, since the successive entries are not independent. There are other operations where it is not clearly defined how the variance estimate should be computed. I think the best way to avoid these problems is to provide read-only iterators.

Let me know what you think. I also added your message as an issue on github.

Best regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Working with standard algorithms

Boost - Dev mailing list

> On 20. Nov 2017, at 12:51, Hans Dembinski <[hidden email]> wrote:
>
> std::partial_sum(h.begin(), h.end(), cumulative.begin(), std::back_inserter(cumulative));

Whoops, that's an incomplete edit… I meant this.

std::partial_sum(h.begin(), h.end(), std::back_inserter(cumulative));

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Working with standard algorithms

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 11/20/2017 12:51 PM, Hans Dembinski wrote:

> I totally see the point for partial_sum. Could you also provide a code example for how you would like to compute the cosine similarity?

A working example using cosine similarity is too large to include here,
but the crux of it is to calculate the cosine similarity between a
histogram h1 and all other histograms in order to determine which of
these other histograms is closest to h1. Anyways, it was just an
example of where it makes sense to use std::inner_product() on two
histograms.

> I think we need several iterators then. The current axis iterators iterate over the bins definition, not really over the bin content of the histogram itself. I chose the current iterators in this way, because I believe this scheme generalises well to the multi-dimensional case. I wouldn't replace these iterators, because they seem useful to me, but additional iterators that iterate over the bin values and play well with algorithms are also clearly needed.
I would rename the current indexing iterator so something like
bin_iterator because it does something special:

   using bin_iterator = unspecified;
   bin_iterator bin_begin() const;
   bin_iterator bin_end() const;

The iterator (and reverse_iterator) types can then be used to traverse
the content.

> - What should this do if h is 2d? The same? If yes, in which order should the 2d array be iterated over?

I would let axis iterators traverse only along the axis. If we need
something to traverse all elements within a multi-dimensional array,
then we probably need something else, such as mdspan [1].

[1] https://github.com/kokkos/array_ref/blob/master/proposals/P0009.rst

> - Writable iterators?

Good argument for const iterators (variance calculation.)

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Working with standard algorithms

Boost - Dev mailing list

> On 26. Nov 2017, at 12:51, Bjorn Reese via Boost <[hidden email]> wrote:
>
> On 11/20/2017 12:51 PM, Hans Dembinski wrote:
>
>> I totally see the point for partial_sum. Could you also provide a code example for how you would like to compute the cosine similarity?
>
> A working example using cosine similarity is too large to include here,
> but the crux of it is to calculate the cosine similarity between a
> histogram h1 and all other histograms in order to determine which of
> these other histograms is closest to h1. Anyways, it was just an
> example of where it makes sense to use std::inner_product() on two
> histograms.

Ok, I just wanted to know whether this example would require the same kind of interface.

I committed a change yesterday which implements value iterators for the histograms. This allows code like this:

auto h = /* make a histogram */
std::vector<double> csum;
std::partial_sum(h.begin(), h.end(), std::back_inserter(csum));

The forward iterator returned here iterates over the values of the histogram in implementation-defined order, skipping under-/overflow bins. For 1d-histograms, this is guaranteed to be the normal order you would expect from any 1d container.

This code works well for 1d-histograms. So what if I have 2d? The forward iterator can tell you the current 2d index of the value it points to. To fill a boost::multi_array, you would do

auto h = /* make 2d histogram, 2x3 */
boost::multi_array<double, 2> a(boost::extends[2][3]);
for (auto iter = h.begin(); iter != h.end(); ++iter)
        a[iter.idx(0)][iter.idx(1)] = *iter;

This iterator is modelled after numpy.nditer in Python.

This approach has pros and cons:
+ simplest solution for users of 1d-histograms who do not care about variances and under-/overflow bins
- still hand-written loop required for multi-dimensional histograms and to access both values and variances

I originally wanted to provide a conversion to const_multi_array_ref, which would solve the iteration problem by connecting histogram to an established multi-dimensional array interface. I looked into the reference and I don't see how it can work, because const_multi_array_ref seems to require raw memory access. I cannot easily hand out a raw pointer to the internal counter memory, because the default storage handles this memory dynamically and switches the internal type used to store the counts as needed. If the type is not known at compile time, I cannot provide a raw pointer with the appropriate type to const_multi_array_ref. It may be possible to use a fancy pointer here (the template allows to specify the pointer type), which does the conversion into a fixed value type upon dereferencing, but I still see conflicts with the multi_array interface, which provides raw access to its internal memory with methods data() and origin().

However, I could convert the internal counters to double when the conversion to const_multi_array_ref is called. This would allow me to hand out a raw pointer to memory. For storage policies which use a fixed counter type, no conversion would be necessary, so that would work as well.

>> I think we need several iterators then. The current axis iterators iterate over the bins definition, not really over the bin content of the histogram itself. I chose the current iterators in this way, because I believe this scheme generalises well to the multi-dimensional case. I wouldn't replace these iterators, because they seem useful to me, but additional iterators that iterate over the bin values and play well with algorithms are also clearly needed.
> I would rename the current indexing iterator so something like
> bin_iterator because it does something special:
>
>  using bin_iterator = unspecified;
>  bin_iterator bin_begin() const;
>  bin_iterator bin_end() const;

I suppose you suggest this to provide more clarity on what the axis iterators iterate over, but this solution has some drawbacks.

The philosophy in this library is to regard an axis as a logical collection of bins (the individual bins do not really exist physically as objects in memory). If you see an axis like that, then you would expect `begin()` and `end()` to provide iterators over the bins. The axis does not know anything about the histogram it belongs to, it is a sub-structure without a callback into the super-structure. Therefore, it technically cannot even provide an iterator over the histogram values in the current design. This hierarchical separation is a nice feature.

I would argue that iterators should not be named for documentation purposes, there are algorithms which expect the standard names const_iterator and iterator, and most of the time you would write `auto` nowadays, so that the typename does not appear in the source code and thus you loose it as a source of self-documentation. The case against `bin_begin()` and `bin_end()` is based on the lost compatibility with range-based for loops, which expect the standard names `begin()` and `end()`.

I think it is better to emphasise the philosophy behind the design in the documentation to clear up these misunderstandings, no?

I am also thinking of changing the bin iterator to make it less surprising, by reusing the design of the value_iterator above. Example code:

auto a = /* make axis */
for (auto iter = a.begin(); iter != a.end(); ++iter) {
   auto index = iter.idx(); // special method to access current bin index
   auto bin = *iter; // the actual bin object, implemented as an interval with methods lower() and upper() for most axis types
}

The range-based for loop would then only iterate over the bins, not providing indices

for (auto bin : a) {
        // do something with bin
}

Since range-based for loops are so nice, I am thinking of providing an adaptor called "enumerate", inspired by Python

for (auto index_and_bin : enumerate(a)) {
   auto index = index_and_bin.first;
   auto bin = index_and_bin.second;
   // do something with index and bin
}

I think this design is less surprising then the current one.

> The iterator (and reverse_iterator) types can then be used to traverse
> the content.

I put rbegin() and rend() for axis types on the to-do list.

>> - What should this do if h is 2d? The same? If yes, in which order should the 2d array be iterated over?
>
> I would let axis iterators traverse only along the axis. If we need
> something to traverse all elements within a multi-dimensional array,
> then we probably need something else, such as mdspan [1].
>
> [1] https://github.com/kokkos/array_ref/blob/master/proposals/P0009.rst

Yeah, mdspan would fit the needs of this library better than multi_array, because the rank can be a dynamic property, while the rank in multi_array is a compile-time constant. Since mdspan is not ready yet, we should still try to provide at least something to iterate over the values, hence my proposal of a value_iterator above.

Best regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost