Adding my sorting algorithm, ska_sort to boost

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Adding my sorting algorithm, ska_sort to boost

Boost - Dev mailing list
Hi,
a couple of years ago I wrote a generalized version of radix sort, called ska_sort. I gave a talk about it at CppNow, available here:
https://www.youtube.com/watch?v=zqs87a_7zxw

I have a new version of that algorithm that is both faster and more general. I can now sort almost anything. For example when I gave the talk I couldn't sort a std::vector<std::list<int>> because std::list doesn't have a random access interface, but now that works. (and it's faster than sorting the same thing with std::sort)

I've also made it much easier to extend the algorithm with your own types. It was always easy if your type isn't too complicated, but now even weird types like std::variant are pretty easy to add. In my old version I had to go deep into the guts to add support for std::variant, now it's easy enough to do that that I'll add it to the documentation.

I'm thinking of releasing the new version in boost, mainly so that it will be used more widely and so that it can be kept working in the long term.

Since this is a long weekend in the US, I will try to as much of the work as possible this weekend. I'm currently trying to figure out how the boost documentation works (I saw that there are quickbook files in Boost.Sort, but how do I run the build for it?) and I've already made good progress on cleaning up my code and making it use the boost::sort namespace.

Before I go any further I wanted to ask though:

Does boost want my sorting algorithm? This would essentially deprecate spreadsort since this one is faster and more general and easier to use. To convince you that this is worth adding, I ran the benchmark that is included in the Boost.Sort source code.
On random numbers here is how long all the algorithms took:
std::sort: 9.04
pdqsort: 4.65
std::stable_sort: 11.57
spinsort: 5.42
flat_stable_sort: 7.26
spreadsort: 4.41
ska_sort: 2.00

There are a lot more bencmarks in the Boost.Sort code, and in general the results are this: If the data is already sorted or mostly sorted, other algorithms are faster, but if there are no easy patterns, ska_sort is always fastest. (I was going to include the entire table in this email, but the formatting breaks. I can attach all the results separately if anyone is interested)

The Boost.Sort code also has benchmarks for sorting "objects" and "strings" and the results are similar there: On random data ska_sort is always fastest.

I think this new version would be great in boost. It's not every day that a new sorting algorithm comes along that's literally twice as fast as what we had before. If we want this in boost, what would the next steps be? From the "submission process" document it sounds like there will be several stages of "Refinement" for the code. (or should this go through the Boost Library Incubator?) After I figure out how the documentation works I can upload a first version of this to github. (I want to get the documentation working so that we can talk about the interface)

Let me know what you think,

Malte

--
  Malte Skarupke
  [hidden email]

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Adding my sorting algorithm, ska_sort to boost

Boost - Dev mailing list
That would be a very useful addition indeed.

--
Janek Kozicki, PhD. DSc. Arch. Assoc. Prof.
Gdańsk University of Technology
Faculty of Applied Physics and Mathematics
Department of Theoretical Physics and Quantum Information
--
pg.edu.pl/jkozicki (click English flag on top right)
On 13 Oct 2019, 13:14 +0200, Malte Skarupke via Boost <[hidden email]>, wrote:

> Hi,
> a couple of years ago I wrote a generalized version of radix sort, called ska_sort. I gave a talk about it at CppNow, available here:
> https://www.youtube.com/watch?v=zqs87a_7zxw
>
> I have a new version of that algorithm that is both faster and more general. I can now sort almost anything. For example when I gave the talk I couldn't sort a std::vector<std::list<int>> because std::list doesn't have a random access interface, but now that works. (and it's faster than sorting the same thing with std::sort)
>
> I've also made it much easier to extend the algorithm with your own types. It was always easy if your type isn't too complicated, but now even weird types like std::variant are pretty easy to add. In my old version I had to go deep into the guts to add support for std::variant, now it's easy enough to do that that I'll add it to the documentation.
>
> I'm thinking of releasing the new version in boost, mainly so that it will be used more widely and so that it can be kept working in the long term.
>
> Since this is a long weekend in the US, I will try to as much of the work as possible this weekend. I'm currently trying to figure out how the boost documentation works (I saw that there are quickbook files in Boost.Sort, but how do I run the build for it?) and I've already made good progress on cleaning up my code and making it use the boost::sort namespace.
>
> Before I go any further I wanted to ask though:
>
> Does boost want my sorting algorithm? This would essentially deprecate spreadsort since this one is faster and more general and easier to use. To convince you that this is worth adding, I ran the benchmark that is included in the Boost.Sort source code.
> On random numbers here is how long all the algorithms took:
> std::sort: 9.04
> pdqsort: 4.65
> std::stable_sort: 11.57
> spinsort: 5.42
> flat_stable_sort: 7.26
> spreadsort: 4.41
> ska_sort: 2.00
>
> There are a lot more bencmarks in the Boost.Sort code, and in general the results are this: If the data is already sorted or mostly sorted, other algorithms are faster, but if there are no easy patterns, ska_sort is always fastest. (I was going to include the entire table in this email, but the formatting breaks. I can attach all the results separately if anyone is interested)
>
> The Boost.Sort code also has benchmarks for sorting "objects" and "strings" and the results are similar there: On random data ska_sort is always fastest.
>
> I think this new version would be great in boost. It's not every day that a new sorting algorithm comes along that's literally twice as fast as what we had before. If we want this in boost, what would the next steps be? From the "submission process" document it sounds like there will be several stages of "Refinement" for the code. (or should this go through the Boost Library Incubator?) After I figure out how the documentation works I can upload a first version of this to github. (I want to get the documentation working so that we can talk about the interface)
>
> Let me know what you think,
>
> Malte
>
> --
> Malte Skarupke
> [hidden email]
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Adding my sorting algorithm, ska_sort to boost

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
Hi Malte,

I see there hasn't been much response to your post yet.

Malte Skarupke wrote:
> a couple of years ago I wrote a generalized version of radix sort,
> called ska_sort. I gave a talk about it at CppNow, available here:
> https://www.youtube.com/watch?v=zqs87a_7zxw
>
> I have a new version of that algorithm that is both faster and more general.

> Before I go any further I wanted to ask though:
>
> Does boost want my sorting algorithm? This would essentially deprecate
> spreadsort since this one is faster and more general and easier to use.

> I think this new version would be great in boost. It's not every day
> that a new sorting algorithm comes along that's literally twice as
> fast as what we had before. If we want this in boost, what would the
> next steps be?

The obvious "next step" would be to see if it can be added to the existing
Boost.Sort library.  Doing that bypasses the public review process.  But
the author of that library may have other ideas.

Personally, often what I need to sort is small or has some complex
comparison function, making radix sort inappropriate.  If I did have
a large dataset, where the performance benefit would be useful, my
first thought would be to find a parallel sort (std:: parallel execution
policy is starting to appear in a few places now.)  Despite that, I
think that what you've got is worthwhile.

Others may be interested in your old blog posts:
https://probablydance.com/2016/12/27/i-wrote-a-faster-sorting-algorithm/


Cheers, Phil.





_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Adding my sorting algorithm, ska_sort to boost

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list

> On Oct 13, 2019, at 03:06, Malte Skarupke via Boost <[hidden email]> wrote:
>
> Hi,
> a couple of years ago I wrote a generalized version of radix sort, called ska_sort. I gave a talk about it at CppNow, available here:
> https://www.youtube.com/watch?v=zqs87a_7zxw
>
> I have a new version of that algorithm that is both faster and more general. I can now sort almost anything. For example when I gave the talk I couldn't sort a std::vector<std::list<int>> because std::list doesn't have a random access interface, but now that works. (and it's faster than sorting the same thing with std::sort)
>

This seems like a product of a large investment of effort.
The big advantage over any previously available radix_sort
is the ability to sort almost any datatype if a key maps to an integer somehow.
Plus, an advantage in measured practical speed, and apparently no claims about Big O improvement.

The average performance is O(n*b) where b is unknown and/or dependent on n ?

Is  the theoretical worst-case O(n^2) ? Why? Is it because of working in-place?
(yes, I understand you fall back to std::sort if you hit this case)

Is this software/algorithm fully documented? - I mean by another way than reading the source code.

Maybe too many questions at once, but it was not clear after watching your talk at cppcon and the blog posts.

Cheers,
Kostas

 <https://www.youtube.com/watch?v=Nz1KZXbghj8>

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Adding my sorting algorithm, ska_sort to boost

Boost - Dev mailing list
Hi Malte,
I am Francisco Tapia, one of the co-authors of the Sort library.
We will examine your algorithm. But we must examine the internal parts of
the algorithm, and evaluate the correctness of the algorithm, how is
implemented the worst case and a couple of details. After, we check the
speed with the benchmarks used in the library, with several sets of data.
Since the near sorted to the fully unsorted, and from the single numbers to
structs with heavy comparison functions.

Now, I am overburdened by work, and I can't revise until December. But we
will do and provide to you a reasoned response, and according to it, decide
the inclusion or not in the library.
As commented by Phil Endecott, the most interesting now is the parallel
algorithms. Today, we don't have any parallel algorithm based on radix sort
or any other hybrid algorithm.
Because with small sets of data the differences are insignificant. But when
you have big sets of data, use parallel algorithms, and then, the new
single thread algorithms are in the middle, in the no man's land.

If you have any doubt or question, contact me, and I will try to answer you.

El mar., 15 oct. 2019 a las 23:06, Kostas Savvidis via Boost (<
[hidden email]>) escribió:

>
> > On Oct 13, 2019, at 03:06, Malte Skarupke via Boost <
> [hidden email]> wrote:
> >
> > Hi,
> > a couple of years ago I wrote a generalized version of radix sort,
> called ska_sort. I gave a talk about it at CppNow, available here:
> > https://www.youtube.com/watch?v=zqs87a_7zxw
> >
> > I have a new version of that algorithm that is both faster and more
> general. I can now sort almost anything. For example when I gave the talk I
> couldn't sort a std::vector<std::list<int>> because std::list doesn't have
> a random access interface, but now that works. (and it's faster than
> sorting the same thing with std::sort)
> >
>
> This seems like a product of a large investment of effort.
> The big advantage over any previously available radix_sort
> is the ability to sort almost any datatype if a key maps to an integer
> somehow.
> Plus, an advantage in measured practical speed, and apparently no claims
> about Big O improvement.
>
> The average performance is O(n*b) where b is unknown and/or dependent on n
> ?
>
> Is  the theoretical worst-case O(n^2) ? Why? Is it because of working
> in-place?
> (yes, I understand you fall back to std::sort if you hit this case)
>
> Is this software/algorithm fully documented? - I mean by another way than
> reading the source code.
>
> Maybe too many questions at once, but it was not clear after watching your
> talk at cppcon and the blog posts.
>
> Cheers,
> Kostas
>
>  <https://www.youtube.com/watch?v=Nz1KZXbghj8>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Adding my sorting algorithm, ska_sort to boost

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
Thanks for the replies. I finished a first pass of the documentation, tests, and other work of including my algorithm in boost and uploaded it here:

https://github.com/skarupke/sort/tree/master

I also re-ran all the single threaded benchmarks and updated the documentation.

For the big O complexity: I am actually still not quite sure what it is. I used similar notation to what spreadsort uses. I think O(n*key_length) is correct for the worst case, but for long keys (like when sorting long strings) the algorithm may fall back to comparison based sorting, so I used the same thing as spreadsort: min(O(n log n), O(n key_length)).

For speed: ska_sort should be the fastest sorting algorithm even for small numbers of items. I included three graphs of benchmarks, for sorting ints, floats and strings:
https://github.com/skarupke/sort/blob/master/doc/images/sorting_ints.png
https://github.com/skarupke/sort/blob/master/doc/images/sorting_strings.png
https://github.com/skarupke/sort/blob/master/doc/images/sorting_floats.png

For ints and strings ska_sort is always fastest, for floats pdqsort is briefly faster, but starting at N >= 128 ska_sort is faster. Part of the work of this new version of ska_sort was to make radix sort really fast even for small N. Before I would use std::sort for any small array of less than 128 items. Now that cutoff is at 32 items, so I skip straight to insertion sort and never do std::sort.

The biggest effect of this change is that it removes the "waves" from the benchmark graph, like we still see for spreadsort in the above picture. I used to also have those when I gave a talk about this algorithm. Now the graph is much more smooth. (this also confuses my answer to the question "what is the big O complexity of this algorithm?" Because the O(n*b) answer that I gave in my talk was justified by the presence of the waves)

The biggest questions I have at this point are around the interface. I reworked that completely so that even relatively advanced use cases are now pretty straightforward. I don't think that anyone else will ever need to use those advanced features, so I kept the documentation of them small, choosing instead to focus on the common use cases. And I made sure to give lots of examples of converting existing comparison-based sorting examples to my interface.

Oh and for the big point of a parallel version of this algorithm: At this point I have no idea how to do that. I haven't even begun to look into parallel algorithms. The current version was a lot of work already. I think that'll have to be a project for another time. (it's definitely appealing though since currently the sorted algorithms are only 2.4 times as fast as ska_sort on a machine with 8 cores/16 hyper threads in benchmark_numbers.cpp, so even a little bit of parallelism would beat the existing algorithms...)

Francisco, since you'll be busy in the near term it sounds like I can't expect too much response for another month, but I would really appreciate it if you could look at it a little just to see if anything obvious stands out to you that I can work on while I wait. (otherwise no biggie, I understand how hard it can be to find time for maintaining projects, since that is exactly one of the reasons why I want to get this into boost...)

Thanks,

Malte

--
  Malte Skarupke
  [hidden email]

> Date: Sun, 20 Oct 2019 13:24:06 +0200
> From: Francisco Jos? Tapia <[hidden email]>
> To: [hidden email]
> Subject: Re: [boost] Adding my sorting algorithm, ska_sort to boost
> Message-ID:
> <[hidden email]>
> Content-Type: text/plain; charset="UTF-8"
>
> Hi Malte,
> I am Francisco Tapia, one of the co-authors of the Sort library.
> We will examine your algorithm. But we must examine the internal parts of
> the algorithm, and evaluate the correctness of the algorithm, how is
> implemented the worst case and a couple of details. After, we check the
> speed with the benchmarks used in the library, with several sets of data.
> Since the near sorted to the fully unsorted, and from the single numbers to
> structs with heavy comparison functions.
>
> Now, I am overburdened by work, and I can't revise until December. But we
> will do and provide to you a reasoned response, and according to it, decide
> the inclusion or not in the library.
> As commented by Phil Endecott, the most interesting now is the parallel
> algorithms. Today, we don't have any parallel algorithm based on radix sort
> or any other hybrid algorithm.
> Because with small sets of data the differences are insignificant. But when
> you have big sets of data, use parallel algorithms, and then, the new
> single thread algorithms are in the middle, in the no man's land.
>
> If you have any doubt or question, contact me, and I will try to answer you.
>
> El mar., 15 oct. 2019 a las 23:06, Kostas Savvidis via Boost (<
> [hidden email]>) escribi?:
>
> >
> > > On Oct 13, 2019, at 03:06, Malte Skarupke via Boost <
> > [hidden email]> wrote:
> > >
> > > Hi,
> > > a couple of years ago I wrote a generalized version of radix sort,
> > called ska_sort. I gave a talk about it at CppNow, available here:
> > > https://www.youtube.com/watch?v=zqs87a_7zxw
> > >
> > > I have a new version of that algorithm that is both faster and more
> > general. I can now sort almost anything. For example when I gave the talk I
> > couldn't sort a std::vector<std::list<int>> because std::list doesn't have
> > a random access interface, but now that works. (and it's faster than
> > sorting the same thing with std::sort)
> > >
> >
> > This seems like a product of a large investment of effort.
> > The big advantage over any previously available radix_sort
> > is the ability to sort almost any datatype if a key maps to an integer
> > somehow.
> > Plus, an advantage in measured practical speed, and apparently no claims
> > about Big O improvement.
> >
> > The average performance is O(n*b) where b is unknown and/or dependent on n
> > ?
> >
> > Is  the theoretical worst-case O(n^2) ? Why? Is it because of working
> > in-place?
> > (yes, I understand you fall back to std::sort if you hit this case)
> >
> > Is this software/algorithm fully documented? - I mean by another way than
> > reading the source code.
> >
> > Maybe too many questions at once, but it was not clear after watching your
> > talk at cppcon and the blog posts.
> >
> > Cheers,
> > Kostas
> >
> >  <https://www.youtube.com/watch?v=Nz1KZXbghj8>
> >
> > _______________________________________________
> > Unsubscribe & other changes:
> > http://lists.boost.org/mailman/listinfo.cgi/boost
> >
>
>
> ------------------------------
>
> Subject: Digest Footer
>

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost