[histogram] Variance

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[histogram] Variance

Boost - Dev mailing list
The variance of individual bins can be obtained when using the
adaptive_storage (via h.at(i).variance().)

I am trying to understand the overhead of this feature.

If I interpret the code correctly, there is a space overhead because
each counter has to keep track of both the count and the sum of squares.
The computational overhead is that the sum of squares has to be
calculated for each insertion. Is this correct?

If so, is there any way to use the adaptive storage policy without
variance?

Furthermore, why does variance() return the sum of squares? Should this
not be divided by the sample size?

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Variance

Boost - Dev mailing list
AMDG

On 09/17/2018 02:08 PM, Bjorn Reese via Boost wrote:

> The variance of individual bins can be obtained when using the
> adaptive_storage (via h.at(i).variance().)
>
> I am trying to understand the overhead of this feature.
>
> If I interpret the code correctly, there is a space overhead because
> each counter has to keep track of both the count and the sum of squares.
> The computational overhead is that the sum of squares has to be
> calculated for each insertion. Is this correct?
>

It's only tracked if you use weights.

> If so, is there any way to use the adaptive storage policy without
> variance?
>
> Furthermore, why does variance() return the sum of squares? Should this
> not be divided by the sample size?
>

You're thinking of the formula
variance = \sum (x_i - mean)^2 / count = \sum x_i^2/count - mean^2
That formula doesn't apply in this case, since the variance
is the variance of the bin count, not the variance of the
weights.  The estimate for the variance is described here:
http://hdembinski.github.io/histogram/doc/html/histogram/rationale.html#histogram.rationale.variance

In Christ,
Steven Watanabe

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Variance

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
Dear Bjørn,

> On 17. Sep 2018, at 22:08, Bjorn Reese via Boost <[hidden email]> wrote:
>
> The variance of individual bins can be obtained when using the
> adaptive_storage (via h.at(i).variance().)
>
> I am trying to understand the overhead of this feature.
>
> If I interpret the code correctly, there is a space overhead because each counter has to keep track of both the count and the sum of squares.
> The computational overhead is that the sum of squares has to be
> calculated for each insertion. Is this correct?
>
> If so, is there any way to use the adaptive storage policy without
> variance?

there is a minor overhead in the return value. Whenever you query the adaptive_storage, two doubles - one for the value and one for the variance -, which is slightly wasteful if you don't care about the variance, then you would need only one double. I don't know how smart compilers are in this case, the compiler may even remove the code that fills the second double when it is not used. In memory, the adaptive_storage uses only a single integer for each counter if you don't use weighted fills.

Returning two doubles even if one is sufficient is a minor overhead, but if this is bothering people I could add a compile-time option for the adaptive_storage class to turn all weight-handling off.

> Furthermore, why does variance() return the sum of squares? Should this
> not be divided by the sample size?

This was already answered by Steven (thanks!).

Kind regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Variance

Boost - Dev mailing list

> On 18. Sep 2018, at 09:41, Hans Dembinski <[hidden email]> wrote:
>
> there is a minor overhead in the return value. Whenever you query the adaptive_storage, two doubles - one for the value and one for the variance -, which is slightly wasteful if you don't care about the variance, then you would need only one double. I don't know how smart compilers are in this case, the compiler may even remove the code that fills the second double when it is not used. In memory, the adaptive_storage uses only a single integer for each counter if you don't use weighted fills.

Ah, sorry, I don't have a stroke, I am just an hurry, should have read again before sending.

There is a minor overhead when the return value is created as you call ".at(…)". Whenever you query the adaptive_storage, two doubles are filled - one for the value and one for the variance -, which is slightly wasteful if you don't care about the variance. In that case, you would need only one double. I don't know how smart compilers are in this case, the compiler may even remove the code that fills the second double if it is not used. In memory, the adaptive_storage uses only a single integer for each counter if you don't use weighted fills.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Variance

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 09/17/18 22:33, Steven Watanabe via Boost wrote:

> You're thinking of the formula
> variance = \sum (x_i - mean)^2 / count = \sum x_i^2/count - mean^2

Indeed.

> That formula doesn't apply in this case, since the variance
> is the variance of the bin count, not the variance of the
> weights.  The estimate for the variance is described here:
> http://hdembinski.github.io/histogram/doc/html/histogram/rationale.html#histogram.rationale.variance

Ok, so weights are used to increase the bin count by a certain amount,
and the variance is an estimate of the spread of these weighted counts.

I had initially assumed that the per-bin variance measured how much
values that are put into a bin deviates from its center; e.g. the
midpoint of the bin, or the bin average.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Variance

Boost - Dev mailing list
Hi,

Sorry to get back to the issue of variance. I am unsure about the justification of choosing the variance based on the Poisson distribution instead of the binomial distribution.

My understanding is that the Poisson distribution is based on a distribution of a number of event given a continuous domain of opportunities (say a period of time). Whereas a binomial distribution is for a number of event for a discrete number of opportunities (say coin flips).

Both seem appropriate in some use cases. However, the histogram class has no sense of the passage of time, whereas it does know the number of discrete opportunities (every time operator () is called).  And the typical use of histogram seems to be to distribute a given number of samples over the bins that they belong to.

So, would it not be more appropriate to estimate variance based on the binomial distribution?

I am not a statistician, so happy to be put right.

Alex
 



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Variance

Boost - Dev mailing list
Hi,

> On 26. Sep 2018, at 15:16, a.hagen-zanker--- via Boost <[hidden email]> wrote:
>
> Sorry to get back to the issue of variance. I am unsure about the justification of choosing the variance based on the Poisson distribution instead of the binomial distribution.
>
> My understanding is that the Poisson distribution is based on a distribution of a number of event given a continuous domain of opportunities (say a period of time). Whereas a binomial distribution is for a number of event for a discrete number of opportunities (say coin flips).
>
> Both seem appropriate in some use cases. However, the histogram class has no sense of the passage of time, whereas it does know the number of discrete opportunities (every time operator () is called).  And the typical use of histogram seems to be to distribute a given number of samples over the bins that they belong to.
>
> So, would it not be more appropriate to estimate variance based on the binomial distribution?

the choice is between Poisson distribution and the multinomial distribution, and it is a bit subtle.
https://en.wikipedia.org/wiki/Multinomial_distribution
Either can be correct, depending on the scenario.

Poisson is correct, for example, when you monitor a random process for a while which produces some value x at random points in time with a constant rate. You bin the outcomes, and then stop monitoring at an arbitrary point in time. This is the right way to model many physics experiments. It is also correct if you make a survey with a random number of participants, i.e. when you pass the survey to a large number of people without knowing beforehand how many are going to respond.

Multinomial is correct, when there is a predefined fixed number of events, each with a random exclusive outcome, and you bin those outcomes. The important point is that the number of events is fixed before the experiment is conducted. This is the main difference to the previous case, where the total of events is not known beforehand. This would be correct, if you make a survey with a fixed number of participants, which you invite explicitly and don't start the analysis before all have return the survey.

If you have many bins in your histogram, the difference between the two becomes negligible. The variance for a multinomial count is n p (1 - p) where p is the probability to fall into this bin. The variance for a Poissonian count is p n, if you write it in the same way. If you have many bins, then p << 1 and n p (1 - p) = n p + O(p^2).

Best regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Variance

Boost - Dev mailing list
On Mon, 1 Oct 2018 at 15:04, Hans Dembinski via Boost <[hidden email]>
wrote:

> Poisson is correct, for example, when you monitor a random process for a
> while which produces some value x at random points in time with a constant
> rate. You bin the outcomes, and then stop monitoring at an arbitrary point
> in time. This is the right way to model many physics experiments. It is
> also correct if you make a survey with a random number of participants,
> i.e. when you pass the survey to a large number of people without knowing
> beforehand how many are going to respond.
>
> Multinomial is correct, when there is a predefined fixed number of events,
> each with a random exclusive outcome, and you bin those outcomes. The
> important point is that the number of events is fixed before the experiment
> is conducted. This is the main difference to the previous case, where the
> total of events is not known beforehand. This would be correct, if you make
> a survey with a fixed number of participants, which you invite explicitly
> and don't start the analysis before all have return the survey.
>

From what you are saying, and I have no knowledge at all in this matter
[just reading what you say], it seems that a policy approach, to allow for
both distributions, seems appropriate. Don't want to give you more work,
but you just made the [that] point yourself.

degski
--
*“If something cannot go on forever, it will stop" - Herbert Stein*

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [histogram] Variance

Boost - Dev mailing list

> On 1. Oct 2018, at 15:04, degski <[hidden email]> wrote:
>
> From what you are saying, and I have no knowledge at all in this matter [just reading what you say], it seems that a policy approach, to allow for both distributions, seems appropriate. Don't want to give you more work, but you just made the [that] point yourself.

No problem, the point of the review is to discover any weaknesses of the design that we will regret later.

Please let's continue the discussion here:
https://github.com/HDembinski/histogram/issues/114


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost