Hi all,
It seems that the matrix multiplication in ublas ends up with the trivial algorithm. On my machine, even the following function outperforms it for square matrices bigger than 173*173 (by a huge margin for matrices bigger than 190*190), while not performing considerably worse for smaller matrices: matrix<double> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs) { assert(lhs.size2() == rhs.size1()); matrix<double> rv(lhs.size1(), rhs.size2()); matrix<double> r = trans(rhs); for (unsigned c = 0; c < rhs.size2(); c++) { matrix_column<matrix<double> > out(rv, c); matrix_row<matrix<double> > in(r, c); out = prod(lhs, in); } return rv; } Is there anybody working on improving the matrix multiplication performance? If not, then I can try to find some spare cycles ... Cheers, Imre Palik _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
Hi Palik,
this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a non-problem in the future. On 2016-01-17 21:23, palik imre wrote: > Hi all, > > It seems that the matrix multiplication in ublas ends up with the > trivial algorithm. On my machine, even the following function > outperforms it for square matrices bigger than 173*173 (by a huge > margin for matrices bigger than 190*190), while not performing > considerably worse for smaller matrices: > > matrix<double> > matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs) > { > assert(lhs.size2() == rhs.size1()); > matrix<double> rv(lhs.size1(), rhs.size2()); > matrix<double> r = trans(rhs); > for (unsigned c = 0; c < rhs.size2(); c++) > { > matrix_column<matrix<double> > out(rv, c); > matrix_row<matrix<double> > in(r, c); > out = prod(lhs, in); > } > return rv; > } > > > Is there anybody working on improving the matrix multiplication > performance? > > If not, then I can try to find some spare cycles ... > > Cheers, > > Imre Palik > _______________________________________________ > ublas mailing list > [hidden email] > http://lists.boost.org/mailman/listinfo.cgi/ublas > Sent to: [hidden email] ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
Hi Palik and Oswin,
I attached a simple implementation of a matrix-matrix product (gemm.hpp) which has less then 250 lines of code. I also attached a simple test programm (matprod.cc). The test Programm has a simple wrapper for the implementation in gemm.hpp. For a real applications you should use traits like for the BLAS bindings. You can compile it with g++ -O3 -Wall -std=c++11 -I $BOOST_PATH matprod.cc where BOOST_PATH contains your boost implementation. Let me know if this works for you. Cheers, Michael ----------------------------------------------------------------------------------- Dr. Michael Lehn University of Ulm, Institute for Numerical Mathematics Helmholtzstr. 20 D-89069 Ulm, Germany Phone: (+49) 731 50-23534, Fax: (+49) 731 50-23548 ----------------------------------------------------------------------------------- On 18 Jan 2016, at 09:25, Oswin Krause <[hidden email]> wrote:
_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
In reply to this post by Oswin Krause
Is there a public git repo for ublas 2.0? On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote: Hi Palik, this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a non-problem in the future. On 2016-01-17 21:23, palik imre wrote: > Hi all, > > It seems that the matrix multiplication in ublas ends up with the > trivial algorithm. On my machine, even the following function > outperforms it for square matrices bigger than 173*173 (by a huge > margin for matrices bigger than 190*190), while not performing > considerably worse for smaller matrices: > > matrix<double> > matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs) > { > assert(lhs.size2() == rhs.size1()); > matrix<double> rv(lhs.size1(), rhs.size2()); > matrix<double> r = trans(rhs); > for (unsigned c = 0; c < rhs.size2(); c++) > { > matrix_column<matrix<double> > out(rv, c); > matrix_row<matrix<double> > in(r, c); > out = prod(lhs, in); > } > return rv; > } > > > Is there anybody working on improving the matrix multiplication > performance? > > If not, then I can try to find some spare cycles ... > > Cheers, > > Imre Palik > _______________________________________________ > ublas mailing list > [hidden email] > http://lists.boost.org/mailman/listinfo.cgi/ublas > Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
Nope
On 01/19/2016 05:12 AM, palik imre wrote: > s there a public git repo for ublas 2.0? _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
In reply to this post by palik imre
Hi Michael, I cannot see any attachments ... On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote: Is there a public git repo for ublas 2.0? On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote: Hi Palik, this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a non-problem in the future. On 2016-01-17 21:23, palik imre wrote: > Hi all, > > It seems that the matrix multiplication in ublas ends up with the > trivial algorithm. On my machine, even the following function > outperforms it for square matrices bigger than 173*173 (by a huge > margin for matrices bigger than 190*190), while not performing > considerably worse for smaller matrices: > > matrix<double> > matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs) > { > assert(lhs.size2() == rhs.size1()); > matrix<double> rv(lhs.size1(), rhs.size2()); > matrix<double> r = trans(rhs); > for (unsigned c = 0; c < rhs.size2(); c++) > { > matrix_column<matrix<double> > out(rv, c); > matrix_row<matrix<double> > in(r, c); > out = prod(lhs, in); > } > return rv; > } > > > Is there anybody working on improving the matrix multiplication > performance? > > If not, then I can try to find some spare cycles ... > > Cheers, > > Imre Palik > _______________________________________________ > ublas mailing list > [hidden email] > http://lists.boost.org/mailman/listinfo.cgi/ublas > Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
Sorry for the inconvenience. I guess sending attachments to the mailing list is prohibited. I put the
code on a website: http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/ On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote:
_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
Hello, I think that some development about a new product algorithm was done in the branch https://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication. Thanks and Best Regards, Joaquim Duran 2016-01-19 15:41 GMT+01:00 Michael Lehn <[hidden email]>:
_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
So has anyone compared the performance yet? On 20 Jan 2016, at 00:47, Joaquim Duran Comas <[hidden email]> wrote:
_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
Hello all,
The algorithm in this branch is aiming at cascading kernels and is by no means complete. Each kernel represents various sizes of the block matrix used in the inner loop of the multiplication. The multiplication is performed by picking the kernel that would provide the highest performance for certain matrices dimensions ( I think this has not been pushed in this branch). It is considerably faster (last time I checked) from typical block algorithms that are using fixed size block sizes, especially for non-square cases. In square cases it is more or less the same as typical - explicit vectorization - block based - algorithms. This applies when comparing to most linalg libraries out there. It is also trivially paralellizable. There are some issues in fully bringing this into ublas, but they can be probably dealt with. Here are a few: 1. Different CPU architectures require different block sizes and different kernel cascade architectures. Hence there is the need for a facility to define these at compile time. 2. Statically cascading the kernels end into large compilation times and large executables. I haven't performed any benchmarks using dynamic block sizes to avoid these issues. 3. I haven't been able to find a systematic way of figuring out the cascade, but to run benchmark tests on each CPU architecture and create an ordered list. These benchmarks indicate that there are better block size choices than the ones that can be computed by the rational in (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf). 4. I never run (3) to build a cascade using different number of threads on SMP machines. 5. There is the need for proper introspection so that the product operates only on ublas::matrix(es) and not generally on matrix_expression(s) I am afraid I don't have time to fully realize the implementation at the moment, especially on how to deal elegantly with (1) and (2). As a first start Michael Lehn's algorithm could be implemented (and it is much much faster than the current simplistic implementation). Michael: Have you checked the performance against Blaze for a few non-square matrix multiplications? - Nasos On 01/20/2016 03:46 PM, Michael Lehn
wrote:
_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
Hi Nasos,
first of all I don’t want to take wrong credits and want to point out that this is not my algorithm. It is based on For a few cores (4-8) it can easily made multithreaded. For many-cores like Intel Xeon Phi this is a bit more sophisticated but still not too hard. The demo I posted does not use micro kernels that exploit SSE, AVX or FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote my own micro-kernels but the interface of function ugemm is compatible to BLIS. Maybe you could help me to integrate your code in the benchmark example I posted above. About Blaze: Do they have their own implementation of a matrix-matrix product? It seems to require a tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product. IMHO they only have tuned the “easy” stuff like BLAS Level1 and Level2. In that case it makes more sense to compare the performance with the actual underlying GEMM implementation. But if I am wrong, let me know. About the block size: In my experience you get better performance if you chose them dynamically at runtime depending on the problem size. Depending on the architecture you can just specify ranges like 256 - 384 for blocking factor MC. In my code it then also needs to satisfy the restriction that it can be divided by factor MR. I know that doing things at compile time is a C++ fetish. But the runtime overhead is negligible and having blocks of similar sizes easily pays of. Cheers, Michael On 21 Jan 2016, at 22:41, nasos <[hidden email]> wrote:
_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
and of course _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
In reply to this post by Michael Lehn-2
Michael,
please see below On 01/21/2016 05:23 PM, Michael Lehn
wrote:
Hi Nasos,Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though. If you compile with -O3 I think you are getting near optimal SSE vectorization. gcc is truly impressive and intel is even more. I will try to find some time to spend on the code. I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL (https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication)
_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
On 22 Jan 2016, at 00:28, nasos <[hidden email]> wrote:
No, believe me. No chance to beat asm :-) _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
On 22 Jan 2016, at 00:30, Michael Lehn <[hidden email]> wrote:
Have a look here for architectures with AVX or FMA its even more impressive. The asm micro kernels are more then just exploiting registers. If you really want to achieve peak performance you have to do actually math on asm-level. I can extend the page with SSE, AVX and FMA micro kernels tomorrow. The performance boost is significant. Something like factor 3 for SSE and factor 5 for FMA. I attached a benchmark from my lecture (http://www.mathematik.uni-ulm.de/numerik/hpc/ws15/uebungen/index.html). It compares on a Intel i5 the GEMM performance: - "Blocked Session 8" is basically what I posted here - “Blocked Session 8 + AVX” replaced “ugemm” with a asm implementation But anyway. I think we need a common base for doing benchmarks, so you and others can convince yourself on your own hardware. _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] bench2.gemm.mflops.svg (33K) Download Attachment |
In reply to this post by Nasos Iliopoulos
They use a log scale for the benchmarks. IMHO that does not make any sense. On this benchmark they only are better for matrix dimensions smaller than 100. Even if you run the same implementations twice you get fluctuations of that magnitude. At 1000 its identical. And outside of the C++ word I never saw log-scales for MFLOPS benchmarks. It makes sense if you compare the runtime of a O(N^k) algorithm. But I don’t see the point for illustrating performance. All this started with the BTL (Benchmark Template Library). But I will look into the BLAZE code to make sure (as in proof). _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
In reply to this post by palik imre
Hi Michael, as for the block size, you could choose it somewhat better. Both MR and NR should be at least the size of a L1 cacheline (i.e., 16 for double and 32 for float on Intel), moreover you should try to align your blocks to L1 cacheline (e.g. using posix_memalign(), or the cache aligned allocator from TBB) This alone brought something like a 50% increase in performance in my tests. This also means that the original implementation stops being competitive around at around 50*50 squares, as opposed to 75*75. I have some ideas about the other parameters too + I will plan to have a look with perf during the weekend. Dynamic vs. static block size: A wild guess would be that dynamic probably worths to do above a different size. Should measure though. Cheers, Imre From: Michael Lehn <[hidden email]> To: ublas mailing list <[hidden email]> Subject: Re: [ublas] Matrix multiplication performance Message-ID: <[hidden email]> Content-Type: text/plain; charset="windows-1252" Hi Nasos, first of all I don?t want to take wrong credits and want to point out that this is not my algorithm. It is based on http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf https://github.com/flame/blis For a few cores (4-8) it can easily made multithreaded. For many-cores like Intel Xeon Phi this is a bit more sophisticated but still not too hard. The demo I posted does not use micro kernels that exploit SSE, AVX or FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote my own micro-kernels but the interface of function ugemm is compatible to BLIS. Maybe you could help me to integrate your code in the benchmark example I posted above. About Blaze: Do they have their own implementation of a matrix-matrix product? It seems to require a tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product. IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2. In that case it makes more sense to compare the performance with the actual underlying GEMM implementation. But if I am wrong, let me know. About the block size: In my experience you get better performance if you chose them dynamically at runtime depending on the problem size. Depending on the architecture you can just specify ranges like 256 - 384 for blocking factor MC. In my code it then also needs to satisfy the restriction that it can be divided by factor MR. I know that doing things at compile time is a C++ fetish. But the runtime overhead is negligible and having blocks of similar sizes easily pays of. Cheers, Michael _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
In reply to this post by Nasos Iliopoulos
On 22 Jan 2016, at 00:28, nasos <[hidden email]> wrote:
I started today with similar experiments on BLAZE and had closer look at their internal implementation. By default they are calling an external BLAS backend. On my machine I used the Intel MKL. But you are right, they also have an internal implementation that can be used if no external BLAS is available. I will publish the results on this page: At the moment the benchmarks for the internal BLAZE implementation for the matrix-matrix product seem to look poor. I asked Klaus Iglberger (the author of BLAZE) to check the compiler flags that I have used. So don’t take the current results as-is. Cheers, Michael _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
On 16 Feb 2016, at 17:09, Michael Lehn <[hidden email]> wrote: >>> >>> About Blaze: Do they have their own implementation of a matrix-matrix product? It seems to require a >>> tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product. >> I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL (https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication) > > I started today with similar experiments on BLAZE and had closer look at their internal implementation. By default > they are calling an external BLAS backend. On my machine I used the Intel MKL. But you are right, they also have > an internal implementation that can be used if no external BLAS is available. I will publish the results on this page: > > http://www.mathematik.uni-ulm.de/~lehn/test_blaze/index.html > > At the moment the benchmarks for the internal BLAZE implementation for the matrix-matrix product seem to look > poor. I asked Klaus Iglberger (the author of BLAZE) to check the compiler flags that I have used. So don’t take the > current results as-is. > I modified the benchmark code in favour for BLAZE as Klaus Iglberger’s suggested. The results are now slightly better. However, compared to Intel MKL they are way off. Similar results I got on (my) Haswell. The problem seems as stated by Klaus Iglberger: "The optimization flags are fine. The performance difference is due to the lack of a specifically optimized kernel.” So on other platforms it might perform much better. ----------------------------------------------------------------------------------- Dr. Michael Lehn University of Ulm, Institute for Numerical Mathematics Helmholtzstr. 20 D-89069 Ulm, Germany Phone: (+49) 731 50-23534, Fax: (+49) 731 50-23548 ----------------------------------------------------------------------------------- _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] |
Free forum by Nabble | Edit this page |