Median in Boost.Accumulators

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Median in Boost.Accumulators

Boost - Dev mailing list
Hello Boost

Some time ago, I needed to compute the exact median for
a stream of elements and after using Boost.Accumulators,
realized that it only provides an approximation of the
median and not the exact number.

I rolled up my custom implementation of the algorithm and
used it for my project. I wanted to check if people on
this list might be interested in such an implementation
for Boost.Accumulators in which case I can try to adapt my
implementation and send a PR.

Performance characteristics: For a stream of n numbers, the
memory required is O(n). Median can be queried at any point
of time in O(1) and adding all the elements costs O(nlog(n))

Lakshay

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Median in Boost.Accumulators

Boost - Dev mailing list
On 09/23/18 04:21, Lakshay Garg via Boost wrote:

> I rolled up my custom implementation of the algorithm and
> used it for my project. I wanted to check if people on
> this list might be interested in such an implementation
> for Boost.Accumulators in which case I can try to adapt my
> implementation and send a PR.

Boost.Accumulators does not seem to be actively maintained.

> Performance characteristics: For a stream of n numbers, the
> memory required is O(n). Median can be queried at any point
> of time in O(1) and adding all the elements costs O(nlog(n))

Sounds like you keep two ordered lists (e.g. std::set) of equal size.
Then the median is the first element in the second list.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Median in Boost.Accumulators

Boost - Dev mailing list
On Sun, 23 Sep 2018 at 07:29, Bjorn Reese via Boost
<[hidden email]> wrote:

>
> On 09/23/18 04:21, Lakshay Garg via Boost wrote:
>
> > I rolled up my custom implementation of the algorithm and
> > used it for my project. I wanted to check if people on
> > this list might be interested in such an implementation
> > for Boost.Accumulators in which case I can try to adapt my
> > implementation and send a PR.
>
> Boost.Accumulators does not seem to be actively maintained.

Is there a standard protocol for dealing with unmaintained
projects? It is sad to see such a useful library without a maintainer.
I would happily volunteer for maintaining this library but by
no means do I have the required skills/expertise for this task.

> > Performance characteristics: For a stream of n numbers, the
> > memory required is O(n). Median can be queried at any point
> > of time in O(1) and adding all the elements costs O(nlog(n))
>
> Sounds like you keep two ordered lists (e.g. std::set) of equal size.
> Then the median is the first element in the second list.

I'm using two std::priority_queues where one is a max heap and
the other is a min heap.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Median in Boost.Accumulators

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Sat, Sep 22, 2018 at 9:22 PM Lakshay Garg via Boost <
[hidden email]> wrote:

> Hello Boost
>
> Some time ago, I needed to compute the exact median for
> a stream of elements and after using Boost.Accumulators,
> realized that it only provides an approximation of the
> median and not the exact number.
>
> I rolled up my custom implementation of the algorithm and
> used it for my project. I wanted to check if people on
> this list might be interested in such an implementation
> for Boost.Accumulators in which case I can try to adapt my
> implementation and send a PR.
>
> Performance characteristics: For a stream of n numbers, the
> memory required is O(n). Median can be queried at any point
> of time in O(1) and adding all the elements costs O(nlog(n))
>

Could you compare and contrast this with the standard library's way of
computing a median, std::nth_element()?  Yout just give it N/2 for 'n', and
it computes the median.

Zach

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Median in Boost.Accumulators

Boost - Dev mailing list
On Sun, 23 Sep 2018 at 08:39, Zach Laine via Boost
<[hidden email]> wrote:

>
> On Sat, Sep 22, 2018 at 9:22 PM Lakshay Garg via Boost <
> [hidden email]> wrote:
>
> > I rolled up my custom implementation of the algorithm and
> > used it for my project. I wanted to check if people on
> > this list might be interested in such an implementation
> > for Boost.Accumulators in which case I can try to adapt my
> > implementation and send a PR.
> >
> > Performance characteristics: For a stream of n numbers, the
> > memory required is O(n). Median can be queried at any point
> > of time in O(1) and adding all the elements costs O(nlog(n))
> >
>
> Could you compare and contrast this with the standard library's way of
> computing a median, std::nth_element()?  Yout just give it N/2 for 'n', and
> it computes the median.

The reason for not using std::nth_element was because I wanted
to be able to query the median at any point of the sequence very
efficiently. Had I used std::nth_element, I would have needed O(n)
time for obtaining the median. For a case where I need to compute
median after every single element, this would have required O(n^2)
time for a sequence of length n. Here is my implementation in case
you would like to have a look:

https://gist.github.com/lakshayg/d27f29b4634cbfea3fc48c21b7b608ef

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Median in Boost.Accumulators

Boost - Dev mailing list
On Sun, 23 Sep 2018 at 11:59, Zach Laine <[hidden email]> wrote:

>
> On Sun, Sep 23, 2018 at 10:59 AM Lakshay Garg <[hidden email]> wrote:
>>
>> On Sun, 23 Sep 2018 at 08:39, Zach Laine via Boost
>> <[hidden email]> wrote:
>> >
>> > Could you compare and contrast this with the standard library's way of
>> > computing a median, std::nth_element()?  Yout just give it N/2 for 'n', and
>> > it computes the median.
>>
>> The reason for not using std::nth_element was because I wanted
>> to be able to query the median at any point of the sequence very
>> efficiently. Had I used std::nth_element, I would have needed O(n)
>> time for obtaining the median. For a case where I need to compute
>> median after every single element, this would have required O(n^2)
>> time for a sequence of length n. Here is my implementation in case
>> you would like to have a look:
>>
>> https://gist.github.com/lakshayg/d27f29b4634cbfea3fc48c21b7b608ef
>
> I missed that the first time, thanks.
>
> Your implementation does not seem to support a windowed working set
> (one element at a time is added and the oldest one removed). Is that right?

Yes, that is correct. This approach does not support a window.

> Also, what if I don't actually call get() very often?

If get() is not called very often, then I guess just using a plain vector and
calling nth_element on it would be the best approach.

> Is there any way to take advantage of make_heap()'s O(n) characteristics,
> and avoid doing an O(n) push_heap() n times?  I can't think of one offhand,
> just curious.

The complexity of push_heap is actually O(log(n)) and not O(n)

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Median in Boost.Accumulators

Boost - Dev mailing list
On Sun, 23 Sep 2018 at 12:53, Zach Laine <[hidden email]> wrote:

>
> On Sun, Sep 23, 2018 at 2:45 PM Lakshay Garg <[hidden email]> wrote:
>>
>> On Sun, 23 Sep 2018 at 11:59, Zach Laine <[hidden email]> wrote:
>> > Also, what if I don't actually call get() very often?
>>
>> If get() is not called very often, then I guess just using a plain vector and
>> calling nth_element on it would be the best approach.
>>
>> > Is there any way to take advantage of make_heap()'s O(n) characteristics,
>> > and avoid doing an O(n) push_heap() n times?  I can't think of one offhand,
>> > just curious.
>>
>> The complexity of push_heap is actually O(log(n)) and not O(n)
>
> [make.heap] says:
>
>  Complexity: At most 3 * (last - first) comparisons.
>

Correct! complexity for make_heap is O(n) and complexity for push_heap
is O(logn).

https://en.cppreference.com/w/cpp/algorithm/make_heap
https://en.cppreference.com/w/cpp/algorithm/push_heap

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Median in Boost.Accumulators

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Sat, 22 Sep 2018 at 19:21, Lakshay Garg <[hidden email]> wrote:

>
> I rolled up my custom implementation of the algorithm and
> used it for my project. I wanted to check if people on
> this list might be interested in such an implementation
> for Boost.Accumulators in which case I can try to adapt my
> implementation and send a PR.
>
> Performance characteristics: For a stream of n numbers, the
> memory required is O(n). Median can be queried at any point
> of time in O(1) and adding all the elements costs O(nlog(n))
>

I looked up the problem of computing median of a stream on SO
and found a few more approaches which are efficient in a rolling
window scenario.

https://stackoverflow.com/questions/10657503/find-running-median-from-a-stream-of-integers

I can try and implement one of those if we decide to have this
in the accumulators library.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Median in Boost.Accumulators

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 09/23/18 22:03, Lakshay Garg via Boost wrote:

> I can try and implement one of those if we decide to have this
> in the accumulators library.

It would be useful if this class can handle other quantiles as well.

The quantiles could be specified as template parameters using std::ratio
in a similar way as the following P^2 estimator does:

 
https://github.com/breese/trial.online/blob/develop/include/trial/online/quantile/psquare.hpp

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Median in Boost.Accumulators

Boost - Dev mailing list
On Sun, 30 Sep 2018 at 08:37, Bjorn Reese via Boost
<[hidden email]> wrote:

>
> On 09/23/18 22:03, Lakshay Garg via Boost wrote:
>
> > I can try and implement one of those if we decide to have this
> > in the accumulators library.
>
> It would be useful if this class can handle other quantiles as well.
>
> The quantiles could be specified as template parameters using std::ratio
> in a similar way as the following P^2 estimator does:
>

Interesting suggestion. Will give it a shot this week, shouldn't be too hard.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost