[proposed][histogram]

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[proposed][histogram]

Boost - Dev mailing list
Dear Boost community,

after another year of work, I hope to raise your interest once again in a much improved version of "histogram", a library proposed for inclusion in Boost.

https://github.com/HDembinski/histogram
http://blincubator.com/bi_library/histogram-2/?gform_post_id=1582

The library can be build either with b2 or cmake. It compiles successfully with several versions of clang and gcc and has 99 % test coverage (the remaining 1 % are basically un-hittable branches). Boost-style documentation is available.

The library implements a histogram class (a highly configurable policy-based template) for C++ and Python in C++11 code. Histograms are a standard tool to explore Big Data. They allow one to visualise and analyse distributions of random variables. A histogram provides a lossy compression of input data. GBytes of input can be put in a compact form which requires only a small fraction of the original memory. This makes histograms convenient for interactive data analysis and further processing.

A histogram implements a quantisation of a space of input values. The space is divided into non-overlapping cells. Instead of remembering the exact values of a tuple in that space, one just increases a counter in the corresponding cell. This is the lossy compression, you only remember the count in the cell and not the original values.

There are subtleties related to the counting if you want it to be safe and efficient. The library handles this for you. The standard policy implements intelligent counters which are fast, conserve memory, and are guaranteed to not overflow or get capped (capping happens when you use floating point numbers to count, like some implementations do). This is one of the two main feature of the library.

The other main feature is that the library allows you to seamlessly create one or multi-dimensional histograms using various schemes to divide the input space. A so-called axis class handles the division into cells along each dimension. There are several ways how you might want to define the cells that divide the input space, and library offers you interesting options. For example, if one of your inputs is an angle, the library provides a special axis class for periodic input values.

What's new:
- Static polymorphism: The original version of the library solely used dynamic polymorphism to support different axis classes. This variant still exists, but the current version also allows you to use static polymorphism, implemented internally with boost::mpl and boost::fusion. This provides a speed-boost, which makes this library faster than the open-source competitors I tested (CERN's ROOT framework, the GSL, and numpy.histogram), in addition to the enhanced flexibility.
- Arbitrary-precision counters: Counter overflow is now completely avoided by switching automatically to boost::multiprecision::cpp_int when the capacity of standard integer types is exhausted.
- Extensibility: The library is now much more extensible and configurable. All relevant behaviour can be exchanged by using another (possibly user-defined) policy. The library provides default trade-offs that should work for almost everybody, but you can chose your own trade-offs.

The new features were inspired by comments I got from the community. Now I am again at a point where I see the library in a very solid state.

Best regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [proposed][histogram]

Boost - Dev mailing list
On 04/12/2017 11:37 AM, Hans Dembinski via Boost wrote:

> The library implements a histogram class (a highly configurable policy-based template) for C++ and Python in C++11 code. Histograms are a standard tool to explore Big Data. They allow one to visualise and analyse distributions of random variables. A histogram provides a lossy compression of input data. GBytes of input can be put in a compact form which requires only a small fraction of the original memory. This makes histograms convenient for interactive data analysis and further processing.

Given that the compression is lossy, I am wondering how it compares with
a distribution estimator like:

   https://arxiv.org/abs/1507.05073v2

A common use-case when collecting numerical data is to determine the
quantiles. Boost.Accumulators contains an estimator (extended_p_square)
for that.

The advantage of such estimators are that they execute in constant time
and with constant memory usage, where the constant depends only on the
required precision.

PS: I am aware that this is a non-trivial question, so I do not expect
     an answer.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [proposed][histogram]

Boost - Dev mailing list
On 2017-04-12 12:34, Bjorn Reese via Boost wrote:

> On 04/12/2017 11:37 AM, Hans Dembinski via Boost wrote:
>
>> The library implements a histogram class (a highly configurable
>> policy-based template) for C++ and Python in C++11 code. Histograms
>> are a standard tool to explore Big Data. They allow one to visualise
>> and analyse distributions of random variables. A histogram provides a
>> lossy compression of input data. GBytes of input can be put in a
>> compact form which requires only a small fraction of the original
>> memory. This makes histograms convenient for interactive data analysis
>> and further processing.
>
> Given that the compression is lossy, I am wondering how it compares
> with
> a distribution estimator like:
>
>   https://arxiv.org/abs/1507.05073v2
>
> A common use-case when collecting numerical data is to determine the
> quantiles. Boost.Accumulators contains an estimator (extended_p_square)
> for that.
>
> The advantage of such estimators are that they execute in constant time
> and with constant memory usage, where the constant depends only on the
> required precision.
>
> PS: I am aware that this is a non-trivial question, so I do not expect
>     an answer.

Hi,

Simple answer: Histograms are not designed for estimating the quantile
function, but the pdf.

While it is true that a sufficiently good estimate of the pdf will give
you an estimate of the quantiles via the inverse of the cdf, the
obtainable precision depends on the size of the bins chosen for the
histogram.

On the other hand, if your data is multi-variate or your pdf
multi-modal, you will have a hard time using quantiles, while you could
still do for example outlier detection using histograms.

Best,
Oswin

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [proposed][histogram]

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 04/12/2017 01:26 PM, Oswin Krause via Boost wrote:
> On 2017-04-12 12:34, Bjorn Reese via Boost wrote:

>> Given that the compression is lossy, I am wondering how it compares with
>> a distribution estimator like:
>>
>>   https://arxiv.org/abs/1507.05073v2

> Simple answer: Histograms are not designed for estimating the quantile
> function, but the pdf.

The first reference I gave is a distribution (pdf and cdf) estimator.

> While it is true that a sufficiently good estimate of the pdf will give
> you an estimate of the quantiles via the inverse of the cdf, the
> obtainable precision depends on the size of the bins chosen for the
> histogram.
>
> On the other hand, if your data is multi-variate or your pdf
> multi-modal, you will have a hard time using quantiles, while you could
> still do for example outlier detection using histograms.

Good answer for the quantile estimators.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [proposed][histogram]

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
Hi Bjorn,

> Given that the compression is lossy, I am wondering how it compares with
> a distribution estimator like:
>
>  https://arxiv.org/abs/1507.05073v2

I have to read the reference carefully, which is quite interesting, but I think the scope of such a density estimator is different.

Histograms are conceptually simple, and simplicity is sometimes a plus. If you really want to have an estimator of the data pdf, then other algorithms may be better. Histograms can be transformed into an estimator of the pdf, but that's not their primary use case in my experience.

In my field, particle physics, we are usually not interested in the data pdf itself. We come up with a theoretical model pdf on our own, which depends on some parameter(s) of interest (e.g. the mass of a new particle). We then adjust this parameter until the theoretical model fits the data. This can done by maximising the likelihood of the model in view of the data. If the data set is big, then it is more practical to use a histogram instead of the original data. We then maximise the likelihood of obtaining such a histogram.

For this purpose, histograms are great, because they have clear properties and the analysis is straight-forward. The counts in the cells follow Poisson distributions, the stochastic fluctuations are independent in each cell. Neither is true for smooth density estimators, which makes them unsuitable for model fitting.

> A common use-case when collecting numerical data is to determine the
> quantiles. Boost.Accumulators contains an estimator (extended_p_square)
> for that.

I had a look into Boost.Accumulators, and my impression was that the algorithms are for one-dimensional data only. The histogram library allows you to handle multi-dimensional input. This goes in addition to what why I wrote above, about the necessity to statistically model the histogram counts.

In summary, the histogram library is not a particularly clever density estimator, but it tries to be the most efficient and convenient implementation of a classical histogram.

Best regards,
Hans


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [proposed][histogram]

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
Hi Oswin,

> On the other hand, if your data is multi-variate or your pdf multi-modal, you will have a hard time using quantiles, while you could still do for example outlier detection using histograms.

yeah, I think so, too.

The histogram library comes with extra bins along each dimension for outliers. Those can be turned off individually for each dimension, if needed.

Best regards,
Hans

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Loading...