[histogram] should some_axis::size() return unsigned or int?

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
Dear community (especially the reviewers of boost.histogram),

I am still working on finishing the library based on the excellent reviews I got in September. I cannot make up my mind about a specific detail of the library interface and would like to ask for your input.

Every axis type is required to have a size() method, which returns the number of bins along that axis (without counting extra bins for under- and overflow). Should this method return `unsigned` or `int`?

= The case for `unsigned` =
The size of an axis is non-negative, so the return type should reflect this.

All the STL containers return their size as an unsigned integer. Naturally, boost.histogram should be as consistent as possible with the STL, so you don't have think about exceptions from the general rule.

Whether `size()` returns a signed or unsigned type matters even in times of `auto`, since compilers like to warn about comparisons between signed and unsigned integers. These warnings and the STL have slowly thought people to write indexed loops with unsigned integers, like so:
```
auto h = make_histogram(…); // 1D histogram

for (unsigned i = 0; i < axis.size(); ++i) {
  auto x = h[i];
  // do something with bin
}
```
We don't want to break this painfully learned habit by letting `size()` return a signed integer, because it would generate a warning in this naive loop.

= The case for `int` =
The type of the bin index returned by an axis is `int`, because it can be negative. An axis is required to return -1 when the axis represents an ordered range and the value fall below the low edge of the first bin. So the index must be signed. The size of an axis is highest index + 1. It would be natural to use the same type for `size()` and for the returned index.

Internally in the library, there are many less-than comparisons between the index and the size of the axis. If the type of the latter is unsigned, the size must be cast to `int` in all these comparisons to avoid compiler warnings, which is adding code clutter in the implementation.

The library provides a range adaptor, called `indexed`, which allows one to iterate over the bins of a histogram with a special proxy, that can return the current bin index and the accumulated value for that bin. The following naive code to ignore under/overflow bins while iterating will generate a warning when `size()` returns `unsigned`
```
auto h = make_histogram(…); // 1D histogram

// iterate over indexed bin range, ignore indices < 0 and indices > size (= underflow/overflow bins)
for (auto x : indexed(h)) {
   if (x[0] < 0 || x[0] >= h.axis().size()) // without warning only when x[0] and h.axis().size() both return `int`
     continue;
  // do something with bin
}
```

People are supposed to use the indexed range to iterate over the bins and not a combination of simple loops like in the first listing, because it scales nicely to multi-dimensional histograms. In multiple dimensions, it is also potentially more efficient. The `indexed` range adaptor automatically moves sequentially in memory, which this is not guaranteed if the index is created by looping manually over several indices, like so
```
for (unsigned i = 0; i < h.axis(0).size(); ++i)
   for (unsigned j = 0; j < h.axis(1).size(); ++j)
{
   auto x = h.at(i, j); // this may jump around in memory, causing cache misses
}
```
How the linear index is generated from the multi-dimensional index is an implementation detail (although the implementation should probably anticipate these naive loops…)

So, if `indexed` is anyway the default recommended way of iterating over a histogram, then size() should return `int` to be compatible with the index type, which is also `int`, because it makes naive code work better.

Sorry for the long email, I would really appreciate your input.

Best regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
On Thu, Nov 29, 2018 at 12:41 PM Hans Dembinski via Boost
<[hidden email]> wrote:
> Every axis type is required to have a size() method, which returns the number of bins along that axis (without counting extra bins for under- and overflow). Should this method return `unsigned` or `int`?

size() returns unsigned
ssize() returns signed ;)



--
Olaf

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
> -----Original Message-----
> From: Boost [mailto:[hidden email]] On Behalf Of Hans Dembinski via Boost
> Sent: 29 November 2018 11:10
> To: Boost Devs
> Cc: Hans Dembinski; Jim Pivarski; Henry Schreiner
> Subject: [boost] [histogram] should some_axis::size() return unsigned or int?
>
> Dear community (especially the reviewers of boost.histogram),
>
> I am still working on finishing the library based on the excellent reviews I got in September. I cannot make up my mind about a
> specific detail of the library interface and would like to ask for your input.
>
> Every axis type is required to have a size() method, which returns the number of bins along that axis (without counting extra bins
> for under- and overflow). Should this method return `unsigned` or `int`?

<big snip>

you need a new name for signed axis size?

int axis_size()

perhaps ;-)

Paul

---
Paul A. Bristow
Prizet Farmhouse
Kendal UK LA8 8AB
+44 (0) 1539 561830





_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
Dear Paul, dear Olaf,

> On 29. Nov 2018, at 14:13, Paul A. Bristow via Boost <[hidden email]> wrote:
>
> you need a new name for signed axis size?
>
> int axis_size()
>
> perhaps ;-)

both of you basically propose to have two similarly named "size" methods which return `unsigned` and `int` respectively. I really hope that you are kidding as indicated by the smileys in your messages.

What I really need is a custom integer type for the index which is signed but can be compared safely with both `int` and `unsigned`. I could create a new integer type with a struct or an enum class and then implement all the operators explicitly. It is a lot of work, though.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
Dear Jim,

> On 29. Nov 2018, at 13:31, Jim Pivarski <[hidden email]> wrote:
>
> It seems to me that if other C++ containers use unsigned integers for size(), you should too, for minimal surprise. The issue you raise about the "which bin?" function returning -1 would be solved in a very-strongly typed language like Haskell as an optional<int>. I know that C++ has optional types now, but I don't know how widely they're used or if there's a significant performance penalty. If this wouldn't look too weird in a C++ program and wouldn't slow it down (or needlessly complicate the code), an optional type would describe your intent more fully than -1.

there is boost::optional, which has the semantics of a pointer and can be used to represent a type that stores a value or not.

```
boost::optional o = some_fickle_function();
if (o) { // optional has a value?
  auto value = *o; // "dereference" to get the value
} else {
  // handle case where value is missing
}
```

This is not a good match here, because -1 here does not have the meaning of "value is missing", but it really is the logical index for the virtual bin that spans from -infinity to the lower edge of the first bin in the axis.

Value arrow:
-inf ——————— | ——————— | —————— | —————— |—————————> +inf
      bin -1    bin 0     bin 1    bin 2    bin 3

I think representing the underflow bin with -1 and the overflow bin with the value returned by size() is very intuitive and elegant.

Best regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
On 30/11/2018 06:12, Hans Dembinski wrote:
> This is not a good match here, because -1 here does not have the meaning of "value is missing", but it really is the logical index for the virtual bin that spans from -infinity to the lower edge of the first bin in the axis.
>
> Value arrow:
> -inf ——————— | ——————— | —————— | —————— |—————————> +inf
>        bin -1    bin 0     bin 1    bin 2    bin 3
>
> I think representing the underflow bin with -1 and the overflow bin with the value returned by size() is very intuitive and elegant.

Conventionally your size_type should be the same type returned by size()
and used for indexing.  So I would expect that type to be int, given the
above.

Having said that, you're already departing from standard container
conventions by having size() return a number that is *sometimes* 2
smaller than the "real" number of bins, which might frustrate generic
algorithms.

Completely without tongue in cheek, I wonder if it might be better to
not provide a size() method at all (to avoid container implications
which you do not satisfy) and spell it as bin_count() or something
distinct instead.


Another possibility that might sidestep all of this could be to remove
the underflow/overflow bins from the standard indexing and make them
only accessible through separate underflow_bin() methods (or similar)
instead.  But this might complicate other cases such as a
find_bin_for_value() that needs to return an underflow/overflow (though
that could be solved by returning a bin iterator, not an index).

I'm not sure if that's actually *better* though. :)

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
On Thu, Nov 29, 2018 at 3:38 PM Gavin Lambert via Boost <
[hidden email]> wrote:
>
> On 30/11/2018 06:12, Hans Dembinski wrote:
> > This is not a good match here, because -1 here does not have the
meaning of "value is missing", but it really is the logical index for the
virtual bin that spans from -infinity to the lower edge of the first bin in
the axis.
> >
> > Value arrow:
> > -inf ——————— | ——————— | —————— | —————— |—————————> +inf
> >        bin -1    bin 0     bin 1    bin 2    bin 3
> >
> > I think representing the underflow bin with -1 and the overflow bin
with the value returned by size() is very intuitive and elegant.

>
> Conventionally your size_type should be the same type returned by size()
> and used for indexing.  So I would expect that type to be int, given the
> above.
>
> Having said that, you're already departing from standard container
> conventions by having size() return a number that is *sometimes* 2
> smaller than the "real" number of bins, which might frustrate generic
> algorithms.
>
> Completely without tongue in cheek, I wonder if it might be better to
> not provide a size() method at all (to avoid container implications
> which you do not satisfy) and spell it as bin_count() or something
> distinct instead.

+1

If the operation differs semantically from .size() in standard containers,
then it shouldn't be called size().

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Thu, Nov 29, 2018 at 5:58 PM Hans Dembinski via Boost
<[hidden email]> wrote:

>
> Dear Paul, dear Olaf,
>
> > On 29. Nov 2018, at 14:13, Paul A. Bristow via Boost <[hidden email]> wrote:
> >
> > you need a new name for signed axis size?
> >
> > int axis_size()
> >
> > perhaps ;-)
>
> both of you basically propose to have two similarly named "size" methods which return `unsigned` and `int` respectively. I really hope that you are kidding as indicated by the smileys in your messages.

Not really: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1227r0.html


--
Olaf

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list

> Conventionally your size_type should be the same type returned by
> size() and used for indexing.  So I would expect that type to be int,
> given the above.
>
> Having said that, you're already departing from standard container
> conventions by having size() return a number that is *sometimes* 2
> smaller than the "real" number of bins, which might frustrate generic
> algorithms.
>
> Completely without tongue in cheek, I wonder if it might be better to
> not provide a size() method at all (to avoid container implications
> which you do not satisfy) and spell it as bin_count() or something
> distinct instead.
>
>
> Another possibility that might sidestep all of this could be to remove
> the underflow/overflow bins from the standard indexing and make them
> only accessible through separate underflow_bin() methods (or similar)
> instead.  But this might complicate other cases such as a
> find_bin_for_value() that needs to return an underflow/overflow
> (though that could be solved by returning a bin iterator, not an index).
A high +1:

Having size() makes on expect that "for(i=0;i<size();i++) h[i]" iterates
over everything which it seems it does not. Hence it should not be
called `size()`

 > I think representing the underflow bin with -1 and the overflow bin
with the value returned by size() is very intuitive and elegant.

IMO this is against expectation of every C++ programmer: Being able to
index a container with -1 and size() so in every review of code using
this it will come up at least as a raised eyebrow.
Other idea: If those bins are so special that they don't fit into the
[0, size()) range, why not use a different function for getting them,
which is not the index operator? high_bin()/low_bin() come to mind.

But WHY was this chosen? Wouldn't it be ok if 0 is the first bin which
starts at -inf and size()-1 to be the last one spanning to inf? This
would allow a histogram of size 1 which has a single bin holding all
values. This makes me wonder to if you *always* want the +-inf bins or
if there are reasonable use cases NOT wanting them and expect
out-of-range values to be dropped.
Disclaimer: I'm not familiar with Boost.histogram and see this from a
users perspective only.




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list

> On 30. Nov 2018, at 08:43, Olaf van der Spek <[hidden email]> wrote:
>
> On Thu, Nov 29, 2018 at 5:58 PM Hans Dembinski via Boost
> <[hidden email]> wrote:
>>
>> Dear Paul, dear Olaf,
>>
>>> On 29. Nov 2018, at 14:13, Paul A. Bristow via Boost <[hidden email]> wrote:
>>>
>>> you need a new name for signed axis size?
>>>
>>> int axis_size()
>>>
>>> perhaps ;-)
>>
>> both of you basically propose to have two similarly named "size" methods which return `unsigned` and `int` respectively. I really hope that you are kidding as indicated by the smileys in your messages.
>
> Not really: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1227r0.html

Ok, I see, thank you for the link. I am not sure if the proposal is the best solution for the show-case problem outlined in the proposal. My personal bias against having to similar "size" methods is high, since I feel that the high level interface now exposes a low-level problem, namely, how basic operators should work on signed and unsigned types.

Best regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list

> On 30. Nov 2018, at 00:38, Gavin Lambert via Boost <[hidden email]> wrote:
>
> On 30/11/2018 06:12, Hans Dembinski wrote:
>> This is not a good match here, because -1 here does not have the meaning of "value is missing", but it really is the logical index for the virtual bin that spans from -infinity to the lower edge of the first bin in the axis.
>> Value arrow:
>> -inf ——————— | ——————— | —————— | —————— |—————————> +inf
>>       bin -1    bin 0     bin 1    bin 2    bin 3
>> I think representing the underflow bin with -1 and the overflow bin with the value returned by size() is very intuitive and elegant.
>
> Conventionally your size_type should be the same type returned by size() and used for indexing.  So I would expect that type to be int, given the above.

Makes sense.

> Having said that, you're already departing from standard container conventions by having size() return a number that is *sometimes* 2 smaller than the "real" number of bins, which might frustrate generic algorithms.

Generic algorithm operate on iterator intervals, so that is a separate matter, I think.

I think there is no perfect solution which is always unsurprising. In some calculations, you want to loop over the extra bins (underflow/overflow), but in most calculations you don't care about them and you want to ignore them. The current design is build on this insight, which is drawn from experience collected with CERN's ROOT library, where the underflow bin is placed at index 0. In that library, the first bin you are usually interested in starts at index 1. This turned out to be much more confusing and surprising to people, especially beginners.

I originally had two methods, `size()` and `shape()`. The former would return the logical size of the axis, the latter the physical size. People didn't like `shape()` so we discussed other options and settled for `extend()`. `extend()` is now a free function in the axis::traits namespace and not a method anymore. The axis only has a `size()` method. In practice, the end user of the library has little use for `extend()`, it is mainly used internally.

> Completely without tongue in cheek, I wonder if it might be better to not provide a size() method at all (to avoid container implications which you do not satisfy) and spell it as bin_count() or something distinct instead.

I am tempted by your argument, but not convinced whether this will really make the library less surprising. An axis may not be a container, but `for (int i = 0; i < axis.size(); ++i)` is what you would expect to work since it is similar enough to a container. We use these kinds of analogies everywhere, e.g. boost::optional uses pointer semantics, although it is not a pointer. Or iterators in general, they are not pointers and most of them do not provide the full semantics of a real pointer.

> Another possibility that might sidestep all of this could be to remove the underflow/overflow bins from the standard indexing and make them only accessible through separate underflow_bin() methods (or similar) instead.  But this might complicate other cases such as a find_bin_for_value() that needs to return an underflow/overflow (though that could be solved by returning a bin iterator, not an index).

Any other solution is worse, I hope the following argument shows this. Remember the value arrow from my other email. There is no other convention for the underflow and overflow indices that is as intuitive as this convention. It also makes looping over all bins, including the extra bins every easy:

for (int i = 0; i < axis.size(); ++i) // loop over internal bins, what you would naively do
for (int i = -1; i <= axis.size(); ++i) // loop over internal and extra bins

In any other convention, I would have to loop over the internal bins and then handle the extra bins outside of the loop.

for (int i = 0; i < axis.size(); ++i)
   do_something_with_index(i);
do_something_with_index(axis.underflow_bin());
do_something_with_index(axis.overflow_bin());

This is clearly less elegant in code.

Best regards,
Hans


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list

> On 30. Nov 2018, at 08:51, Alexander Grund via Boost <[hidden email]> wrote:
>
>
>> Conventionally your size_type should be the same type returned by size() and used for indexing.  So I would expect that type to be int, given the above.
>>
>> Having said that, you're already departing from standard container conventions by having size() return a number that is *sometimes* 2 smaller than the "real" number of bins, which might frustrate generic algorithms.
>>
>> Completely without tongue in cheek, I wonder if it might be better to not provide a size() method at all (to avoid container implications which you do not satisfy) and spell it as bin_count() or something distinct instead.
>>
>>
>> Another possibility that might sidestep all of this could be to remove the underflow/overflow bins from the standard indexing and make them only accessible through separate underflow_bin() methods (or similar) instead.  But this might complicate other cases such as a find_bin_for_value() that needs to return an underflow/overflow (though that could be solved by returning a bin iterator, not an index).
>
> A high +1:
>
> Having size() makes on expect that "for(i=0;i<size();i++) h[i]" iterates over everything which it seems it does not. Hence it should not be called `size()`

You are overestimating the importance of the *-flow bins, I think. Users usually ignore them when they analyse their histograms. They must be there for other reasons which are explained in the rationale and they are very useful for expert-level statistical analyses and for debugging. The beginner, however, should not notice their presence.

In fact, the `indexed` range adaptor should probably skip them by default, and only iterate over them when that is explicitly requested.

> > I think representing the underflow bin with -1 and the overflow bin with the value returned by size() is very intuitive and elegant.
>
> IMO this is against expectation of every C++ programmer: Being able to index a container with -1 and size() so in every review of code using this it will come up at least as a raised eyebrow.

An axis is not a container. It does not hold values and it has no operator[], precisely to emphasise this difference. It has size() though. See my email to Gavin with a long explanation why I think that makes sense.

> Other idea: If those bins are so special that they don't fit into the [0, size()) range, why not use a different function for getting them, which is not the index operator? high_bin()/low_bin() come to mind.

See explanation to Gavin why this is worse.

> But WHY was this chosen? Wouldn't it be ok if 0 is the first bin which starts at -inf and size()-1 to be the last one spanning to inf? This would allow a histogram of size 1 which has a single bin holding all values.

And why would you want such an axis? It would be pointless and make the histogram operate slower. Nevertheless, you can easily generate such an axis as user-defined axis type

// this axis is valid and can be plugged into boost.histogram
struct my_axis {
  int operator()(double x) const { return 0; } // always return 0
  unsigned size() const { return 1; } // axis has size of 1
};

> This makes me wonder to if you *always* want the +-inf bins or if there are reasonable use cases NOT wanting them and expect out-of-range values to be dropped.

Sure, there are reasonable use cases where you don't want them. It is a compile-time option for the builtin axis types (on by default). Also, user-defined axis types are not required to have them (see example above).

> Disclaimer: I'm not familiar with Boost.histogram and see this from a users perspective only.

No problem, your user perspective is appreciated. It would be good to read the rationale, in the docs of boost.histogram though, since it lists the arguments that have been refined in previous discussions already.

Best regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list

Am 30.11.18 um 11:36 schrieb Hans Dembinski:
> You are overestimating the importance of the *-flow bins, I think. Users usually ignore them when they analyse their histograms. They must be there for other reasons which are explained in the rationale and they are very useful for expert-level statistical analyses and for debugging. The beginner, however, should not notice their presence.
>
> In fact, the `indexed` range adaptor should probably skip them by default, and only iterate over them when that is explicitly requested.
Sounds reasonable: A range excluding the over/underflow bins and one
including it.
> An axis is not a container. It does not hold values and it has no operator[], precisely to emphasise this difference. It has size() though. See my email to Gavin with a long explanation why I think that makes sense.
Your code example was the following:

for (unsigned i = 0; i < axis.size(); ++i) {
   auto x = h[i];
   // do something with bin
}

So it looks like a container, although size and []-operator are in
different instances (which feels weird, but ok)
>> Other idea: If those bins are so special that they don't fit into the [0, size()) range, why not use a different function for getting them, which is not the index operator? high_bin()/low_bin() come to mind.
> See explanation to Gavin why this is worse.
Combining this with "Users usually ignore them[...] the `indexed` range
adaptor should probably skip them by default" I do see the need for
extra functions here too. Your argument against "high_bin()/low_bin()"
was: Iteration must be split. But your above comment already suggests,
that there are iterators which can cover the whole range. Could they
solve this split-iteration-problem?
>> But WHY was this chosen? Wouldn't it be ok if 0 is the first bin which starts at -inf and size()-1 to be the last one spanning to inf? This would allow a histogram of size 1 which has a single bin holding all values.
> And why would you want such an axis? It would be pointless and make the histogram operate slower.

I was not saying this should be done. It would just be consistent. There
are 2 dimensions:
- open ranged bins yes/no
- number of bins
In my mind enabling open ranged bins does not ADD bins but makes the
first and last go to +-inf:

axis(4,0,10,"",uoflow_type::on) -> [-inf,0), [0,5), [5,10), [10, inf]
axis(4,0,10,"",uoflow_type::off) -> [0,2.5), [2.5,5), [5,7.5), [7.5,10)

Of course this might be confusing so default should be "off" as "users
usually ignore them" so they are advanced things one does not generally
need, right?
(Side note: The parameter description at
https://hdembinski.github.io/histogram/doc/html/boost/histogram/axis/regular.html 
is confusing due to the list order not matching the parameter order.)

So my TLDR of this is: Consistency and meeting expectations. If it
breaks either, think again about the choices made.

For this it is either:

- *-flow bins are kinda regular bins -> included in size(), iteration,
same behavior like regulars
- *-flow are special bins -> not included in size(), special accessors
and iterators with default ones not including them.
     Given that: Why not have special constants for Underflow AND
Overflow bin (e.g. -1 and -2) (instead of -1 and size(), where the
latter is a runtime constant), then you could have a `int
find_including_ouflow` and a `unsigned find` as well as `get(unsigned)`,
`get_with_uoflow(int)` -> Idea is to make the special handling obvious

Alex

PS: I don't want to push anything. Just my thoughts on your issue in the
hope it helps you finding a solution which you are happy with.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list


> On 30. Nov 2018, at 12:58, Alexander Grund via Boost <[hidden email]> wrote:
>
>
> Am 30.11.18 um 11:36 schrieb Hans Dembinski:
>> You are overestimating the importance of the *-flow bins, I think. Users usually ignore them when they analyse their histograms. They must be there for other reasons which are explained in the rationale and they are very useful for expert-level statistical analyses and for debugging. The beginner, however, should not notice their presence.
>>
>> In fact, the `indexed` range adaptor should probably skip them by default, and only iterate over them when that is explicitly requested.
> Sounds reasonable: A range excluding the over/underflow bins and one including it.
>> An axis is not a container. It does not hold values and it has no operator[], precisely to emphasise this difference. It has size() though. See my email to Gavin with a long explanation why I think that makes sense.
> Your code example was the following:
>
> for (unsigned i = 0; i < axis.size(); ++i) {
>  auto x = h[i];
>  // do something with bin
> }

`i` is the index for the histogram here (!) not for the axis itself. The histogram is also not a container, but it acts more like one. It also has a size() method and the value really  includes all bins.

A histogram is technically a multi-dimensional array, but semantically it is more than that. In a histogram, each dimension of the multi-dimensional array also has an associated arrow of values. The axis types logically represent these arrows, which have indices and values. The axis itself is not very much like a container and I don't expect users to run STL algorithms on axis objects.

> So it looks like a container, although size and []-operator are in different instances (which feels weird, but ok)

I hope it is more clear now.

>>> Other idea: If those bins are so special that they don't fit into the [0, size()) range, why not use a different function for getting them, which is not the index operator? high_bin()/low_bin() come to mind.
>> See explanation to Gavin why this is worse.
> Combining this with "Users usually ignore them[...] the `indexed` range adaptor should probably skip them by default" I do see the need for extra functions here too. Your argument against "high_bin()/low_bin()" was: Iteration must be split. But your above comment already suggests, that there are iterators which can cover the whole range. Could they solve this split-iteration-problem?

Yes, but we are diverging from my original question now. Users are recommended to use the range adaptors and iterators provided by the library. For the adaptors and iterators, all cases can be gracefully implemented.

But I know my potential users very well, they will also use integers as indices to loop over the histogram, because people in target community are often beginner programmers and using an integer index feels natural to them. The question is how to optimise the design for this use case and so that it does not clash with similar (mis)usage of STL containers.

>>> But WHY was this chosen? Wouldn't it be ok if 0 is the first bin which starts at -inf and size()-1 to be the last one spanning to inf? This would allow a histogram of size 1 which has a single bin holding all values.
>> And why would you want such an axis? It would be pointless and make the histogram operate slower.
>
> I was not saying this should be done. It would just be consistent. There are 2 dimensions:
> - open ranged bins yes/no
> - number of bins
> In my mind enabling open ranged bins does not ADD bins but makes the first and last go to +-inf:
>
> axis(4,0,10,"",uoflow_type::on) -> [-inf,0), [0,5), [5,10), [10, inf]
> axis(4,0,10,"",uoflow_type::off) -> [0,2.5), [2.5,5), [5,7.5), [7.5,10)

No, not a good idea. If we do that, then toggling *-flow bins on/off changes your whole program as it is written up to now, it will do something completely different.

A user should be able to code the analysis and then decide: "Ah crap, these extra bins cost too much memory and in my special case they are always empty anyway, because my values never go out of the range that I specified. So let's just turn them off". Doing that optimization should not change logic of the program you wrote so far.

> Of course this might be confusing so default should be "off" as "users usually ignore them" so they are advanced things one does not generally need, right?

I can see that you did not read the rationale carefully yet.

Best regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list

Am 30.11.18 um 13:27 schrieb Hans Dembinski:
> I can see that you did not read the rationale carefully yet.
I just did not understand everything ;)
> `i` is the index for the histogram here (!) not for the axis itself. The histogram is also not a container, but it acts more like one. It also has a size() method and the value really  includes all bins.
Ah ok. So the essence of your question was: What should the types be for
histogram::size() and histogram::operator[]?
BTW: Where is an interface description of that histogram class? In the
reference I only find a forward declaration and "unspecified type", so I
don't know what members it has.

In this case I go with the "same-as-stdlib" approach:

- Types should be the same
- for(auto i = SizeType(0); i<h.size(); i++) h.at(i).doIt() should be
possible and iterate over all bins

The 2nd point implies the *flow-bins should be at axis.size() and
axis.size()+1

> Yes, but we are diverging from my original question now. Users are recommended to use the range adaptors and iterators provided by the library. For the adaptors and iterators, all cases can be gracefully implemented.
>
> But I know my potential users very well, they will also use integers as indices to loop over the histogram, because people in target community are often beginner programmers and using an integer index feels natural to them. The question is how to optimise the design for this use case and so that it does not clash with similar (mis)usage of STL containers.
See above. Especially for beginners it has to be consistent with other
usage. for(int i=-1; i<=h.size(); i++) WILL confuse them.
If you are referring to conversion warnings: Same as with stdlib.
> No, not a good idea. If we do that, then toggling *-flow bins on/off changes your whole program as it is written up to now, it will do something completely different.
>
> A user should be able to code the analysis and then decide: "Ah crap, these extra bins cost too much memory and in my special case they are always empty anyway, because my values never go out of the range that I specified. So let's just turn them off". Doing that optimization should not change logic of the program you wrote so far.

True, ok thanks for the clarification. Then axis.size is what the user
says and h.size()==axis.size||axis.size+2. Putting the *flow bins at the
end allows iterating over axis.size OR h.size with the same code. You
can even go from for(int i=0; i<axis.size(); i++) to iterating till
h.size if you chose to include the extra bins in your iteration.


Regards, Alex




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

smime.p7s (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list


> On 30. Nov 2018, at 14:44, Alexander Grund via Boost <[hidden email]> wrote:
>
>
> Am 30.11.18 um 13:27 schrieb Hans Dembinski:
>> I can see that you did not read the rationale carefully yet.
> I just did not understand everything ;)

Fair enough, but then let me know what you don't understand and I will try to explain it better.

>> `i` is the index for the histogram here (!) not for the axis itself. The histogram is also not a container, but it acts more like one. It also has a size() method and the value really  includes all bins.
> Ah ok. So the essence of your question was: What should the types be for histogram::size() and histogram::operator[]?

No, the question is about axis::size(), whether it should return `int` or `unsigned`. There is no problem with histogram::size() or histogram::operator[].

> BTW: Where is an interface description of that histogram class? In the reference I only find a forward declaration and "unspecified type", so I don't know what members it has.

There is a technical problem in the reference generation that I haven't been able to fix yet. Help from experts would be very much appreciated. In the code, the histogram interface is documented and the documentation is supposed to be part of the reference.

> In this case I go with the "same-as-stdlib" approach:
>
> - Types should be the same
> - for(auto i = SizeType(0); i<h.size(); i++) h.at(i).doIt() should be possible and iterate over all bins

Agreed.

>> No, not a good idea. If we do that, then toggling *-flow bins on/off changes your whole program as it is written up to now, it will do something completely different.
>>
>> A user should be able to code the analysis and then decide: "Ah crap, these extra bins cost too much memory and in my special case they are always empty anyway, because my values never go out of the range that I specified. So let's just turn them off". Doing that optimization should not change logic of the program you wrote so far.
>
> True, ok thanks for the clarification. Then axis.size is what the user says and h.size()==axis.size||axis.size+2. Putting the *flow bins at the end allows iterating over axis.size OR h.size with the same code. You can even go from for(int i=0; i<axis.size(); i++) to iterating till h.size if you chose to include the extra bins in your iteration.

This implies strange behaviour for the axis. Note that you can and sometimes want to use the axis manually to maps values to indices. Then you get this, if I follow your suggestion:

auto a = axis::regular<>(2, 1, 2); // axis with two bins [1, 1.5), [1.5, 2.0)

assert(a(0.99) == 3); // ??
assert(a(1) == 0); // ok
assert(a(1.5) == 1); // ok
assert(a(2.0) == 2); // ok
assert(a(10.0) == 2); // ok

The ordering of the indices is not the same as the ordering of the values.

I find that unacceptable.

Best regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 11/30/18 11:16 AM, Hans Dembinski via Boost wrote:

> In any other convention, I would have to loop over the internal bins and then handle the extra bins outside of the loop.

I find this to be the cleanest solution, as the under-/overflow bins are
special anyways.

If we wish to iterator over all bins, then you could provide a special
iterator for that. Hopefully it would be possible reuse the existing
iterator and simply provide new begin/end functions.

If we wish to index into the under-/overflow bins, then you could
provide an operator[] that overloads on enum types for the underflow
and overflow bins (as suggested in my formal review.)

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 30/11/2018 20:43, Olaf van der Spek wrote:
> Not really: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1227r0.html

Isn’t that proposed definition of ssize() essentially identical to

     // member function
     constexpr difference_type ssize() const
     { return std::distance(begin(), end()); }
     // or moral equivalent, for containers that can implement it
     // more efficiently another way.

     // non-member function in std::
     template <class C>
     constexpr auto ssize(const C& c)
     { return distance(begin(c), end(c)); }
     // or calling member ssize() if it exists

The above seems "better" in that it allows a container intended
exclusively for small sizes to use a difference_type smaller than ptrdiff_t.

(In the text it explains why it's a bad idea for a container using
uint16_t sizes with more than 32767 elements to use int16_t as a
difference_type, but as I understand it that's already undefined
behaviour -- size_type is required to be larger than difference_type and
difference_type should be able to express std::distance(begin(), end())
or some algorithms will break.)

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
[ Gaving cc'd me as the author of
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1227r0.html ]

"distance(begin, end)" isn't O(1) for many types of containers, which is
why P1227 defines ssize() as the static_casted result of size().

Of course, for many containers, size() is implemented as the subtraction of
two pointers, static_casted to size_t.  (See
https://github.com/llvm-mirror/libcxx/blob/master/include/vectorfor
example)  In those cases, it would make more sense to define ssize() as
simply the subtraction of those two pointers.

"difference_type should be able to express std::distance(begin(), end()) or
some algorithms will break"

A similar point came up at the San Diego standards meeting: If you have a
container large enough that it might take up more than half the address
space, ptrdiff_t will not be able to express the difference between begin()
and end().  The conclusion drawn (not necessarily shared by everyone in the
room) was that containers that take up more than half of the address space
are not really supported by C++.  Or at least, not if their iterators have
byte-level granularity.

""better" in that it allows a container intended exclusively for small
sizes to use a difference_type smaller than ptrdiff_t."

I applaud your use of quotation marks around better, since such a
difference_type is often inefficient in that the caller is often forced to
immediately promote it to larger size, before the processor can work with
it.

= - = - = - = - =

Before going down the road of size() and ssize(), the real question is what
should this do:

for (const auto &el: some_axis) {
  // Iterates over the underflow/overflow bins?
}

for (size_t i = 0; i != some_axis.size(); ++i) {
  auto &el = some_axis[i];
  // Iterates over the underflow/overflow bins?
}

It looks like you wanted to use an index of -1 and size() to refer to
underflow and overflow.  That's... interesting... and I mean "interesting"
in a good sense of the word.

What this means is that someone who wants to iterate the *-flow bins might
write:

for (int i = -1; i != some_axis.size() + 1; ++i) {
  auto &el = some_axis[i];
  // Iterates over all the bins, including *-flow.
}

And the code would work.  However someone else might naively write that
first line differently:

for (int i = -1; i < some_axis.size() + 1; ++i) {

and that loop would never iterate any entries at all, if size() returned an
unsigned value.  This code, however, would work:

for (int i = -1; (i == -1) || (i < some_axis.size() + 1); ++i) {

... it just gets really ugly, really fast...

with an ssize() routine, this does work:

for (int i = -1; i < some_axis.ssize() + 1; ++i) {

But this does not:

for (size_t i = -1; i < some_axis.ssize() + 1; ++i) {

= - = - =

If it were me writing the code purely for myself, I would avoid unsigned
integers except for bit manipulation and intended overflow situations.  And
this would all work fine.  Of course that still leaves the question of
whether you want iteration from begin() to end() to include the
overflow/underflow bins.

Since you're writing this code for others, I'd be wary of breaking user's
ingrained size_t / less-than habits.

I'm curious about whether you've considered using 0 and size() - 1 as the
indices for the special bins?  In that case, users would write this to
iterate over all bins:

for (const auto &el: some_axis) {
for (size_t i = 0; i < some_axis.size(); ++i) {

And something like this to iterate over just the non-special bins:

for (size_t i = 1; i <= some_axis.num_bins(); ++i) {

This would support both my "signed everywhere" contingent, as well as the
"size_t everywhere, be sure it never goes negative" contingent.

-- Jorg

On Sun, Dec 2, 2018 at 4:15 PM Gavin Lambert <[hidden email]> wrote:

> On 30/11/2018 20:43, Olaf van der Spek wrote:
> > Not really:
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1227r0.html
>
> Isn’t that proposed definition of ssize() essentially identical to
>
>      // member function
>      constexpr difference_type ssize() const
>      { return std::distance(begin(), end()); }
>      // or moral equivalent, for containers that can implement it
>      // more efficiently another way.
>
>      // non-member function in std::
>      template <class C>
>      constexpr auto ssize(const C& c)
>      { return distance(begin(c), end(c)); }
>      // or calling member ssize() if it exists
>
> The above seems "better" in that it allows a container intended
> exclusively for small sizes to use a difference_type smaller than
> ptrdiff_t.
>
> (In the text it explains why it's a bad idea for a container using
> uint16_t sizes with more than 32767 elements to use int16_t as a
> difference_type, but as I understand it that's already undefined
> behaviour -- size_type is required to be larger than difference_type and
> difference_type should be able to express std::distance(begin(), end())
> or some algorithms will break.)
>

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] should some_axis::size() return unsigned or int?

Boost - Dev mailing list
On 3/12/2018 15:15, Jorg Brown wrote:
> "distance(begin, end)" isn't O(1) for many types of containers, which is
> why P1227 defines ssize() as the static_casted result of size().
>
> Of course, for many containers, size() is implemented as the subtraction of
> two pointers, static_casted to size_t.  (See
> https://github.com/llvm-mirror/libcxx/blob/master/include/vector for
> example)  In those cases, it would make more sense to define ssize() as
> simply the subtraction of those two pointers.

Which is why I added the "moral equivalent" clause. :)

> ""better" in that it allows a container intended exclusively for small
> sizes to use a difference_type smaller than ptrdiff_t."
>
> I applaud your use of quotation marks around better, since such a
> difference_type is often inefficient in that the caller is often forced to
> immediately promote it to larger size, before the processor can work with
> it.

True.  But it also in theory allows for more esoteric indexing, such as
indexing by an enum or a complex value or some other UDT.  (Technically
that doesn't meet the requirements of Container, but provided the type
behaves sufficiently like an integer then most algorithms would still work.)

Whether that's "better" or not might be another argument. :)

And AFAIK all the standard container types use ptrdiff_t as their
difference_type anyway, so it seems reasonable to use that instead of
mandating ptrdiff_t.

> I'm curious about whether you've considered using 0 and size() - 1 as the
> indices for the special bins?

IIRC, Hans stated that another library does this and it makes it more
awkward to use, since (he says) mostly you ignore the extra bins and
turning them on or off changes the indexing of the regular bins.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
12