# Matrix multiplication performance

19 messages
Open this post in threaded view
|
Report Content as Inappropriate

## Matrix multiplication performance

 Hi all, It seems that the matrix multiplication in ublas ends up with the trivial algorithm.  On my machine, even the following function outperforms it for square matrices bigger than 173*173 (by a huge margin for matrices bigger than 190*190), while not performing considerably worse for smaller matrices: matrix matmul_byrow(const matrix &lhs, const matrix &rhs) {   assert(lhs.size2() == rhs.size1());   matrix rv(lhs.size1(), rhs.size2());   matrix r = trans(rhs);   for (unsigned c = 0; c < rhs.size2(); c++)     {       matrix_column > out(rv, c);       matrix_row > in(r, c);       out = prod(lhs, in);     }   return rv; } Is there anybody working on improving the matrix multiplication performance? If not, then I can try to find some spare cycles ... Cheers, Imre Palik _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 Hi Palik, this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a non-problem in the future. On 2016-01-17 21:23, palik imre wrote: > Hi all, > > It seems that the matrix multiplication in ublas ends up with the > trivial algorithm.  On my machine, even the following function > outperforms it for square matrices bigger than 173*173 (by a huge > margin for matrices bigger than 190*190), while not performing > considerably worse for smaller matrices: > > matrix > matmul_byrow(const matrix &lhs, const matrix &rhs) > { >   assert(lhs.size2() == rhs.size1()); >   matrix rv(lhs.size1(), rhs.size2()); >   matrix r = trans(rhs); >   for (unsigned c = 0; c < rhs.size2(); c++) >     { >       matrix_column > out(rv, c); >       matrix_row > in(r, c); >       out = prod(lhs, in); >     } >   return rv; > } > > > Is there anybody working on improving the matrix multiplication > performance? > > If not, then I can try to find some spare cycles ... > > Cheers, > > Imre Palik > _______________________________________________ > ublas mailing list > [hidden email] > http://lists.boost.org/mailman/listinfo.cgi/ublas> Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 Hi Palik and Oswin,I attached a simple implementation of a matrix-matrix product (gemm.hpp) which has lessthen 250 lines of code.  I also attached a simple test programm (matprod.cc).  The testProgramm has a simple wrapper for the implementation in gemm.hpp.  For a real applicationsyou should use traits like for the BLAS bindings.  You can compile it withg++ -O3 -Wall -std=c++11 -I \$BOOST_PATH matprod.ccwhere BOOST_PATH contains your boost implementation.  Let me know if this worksfor you.Cheers,Michael -----------------------------------------------------------------------------------Dr. Michael LehnUniversity of Ulm, Institute for Numerical MathematicsHelmholtzstr. 20D-89069 Ulm, GermanyPhone: (+49) 731 50-23534, Fax: (+49) 731 50-23548----------------------------------------------------------------------------------- On 18 Jan 2016, at 09:25, Oswin Krause <[hidden email]> wrote:Hi Palik,this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a non-problem in the future.On 2016-01-17 21:23, palik imre wrote:Hi all,It seems that the matrix multiplication in ublas ends up with thetrivial algorithm.  On my machine, even the following functionoutperforms it for square matrices bigger than 173*173 (by a hugemargin for matrices bigger than 190*190), while not performingconsiderably worse for smaller matrices:matrixmatmul_byrow(const matrix &lhs, const matrix &rhs){ assert(lhs.size2() == rhs.size1()); matrix rv(lhs.size1(), rhs.size2()); matrix r = trans(rhs); for (unsigned c = 0; c < rhs.size2(); c++)   {     matrix_column > out(rv, c);     matrix_row > in(r, c);     out = prod(lhs, in);   } return rv;}Is there anybody working on improving the matrix multiplication performance?If not, then I can try to find some spare cycles ...Cheers,Imre Palik_______________________________________________ublas mailing list[hidden email]http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]_______________________________________________ublas mailing list[hidden email]http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email] gemm.hpp (7K) Download Attachment matprod.cc (3K) Download Attachment
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 In reply to this post by Oswin Krause Is there a public git repo for ublas 2.0? On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote: Hi Palik,this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a non-problem in the future.On 2016-01-17 21:23, palik imre wrote:> Hi all,> > It seems that the matrix multiplication in ublas ends up with the> trivial algorithm.  On my machine, even the following function> outperforms it for square matrices bigger than 173*173 (by a huge> margin for matrices bigger than 190*190), while not performing> considerably worse for smaller matrices:> > matrix> matmul_byrow(const matrix &lhs, const matrix &rhs)> {>  assert(lhs.size2() == rhs.size1());>  matrix rv(lhs.size1(), rhs.size2());>  matrix r = trans(rhs);>  for (unsigned c = 0; c < rhs.size2(); c++)>    {>      matrix_column > out(rv, c);>      matrix_row > in(r, c);>      out = prod(lhs, in);>    }>  return rv;> }> > > Is there anybody working on improving the matrix multiplication > performance?> > If not, then I can try to find some spare cycles ...> > Cheers,> > Imre Palik> _______________________________________________> ublas mailing list> [hidden email]> http://lists.boost.org/mailman/listinfo.cgi/ublas> Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 Nope On 01/19/2016 05:12 AM, palik imre wrote: > s there a public git repo for ublas 2.0? _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 In reply to this post by palik imre Hi Michael,I cannot see any attachments  ... On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote: Is there a public git repo for ublas 2.0? On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote: Hi Palik,this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a non-problem in the future.On 2016-01-17 21:23, palik imre wrote:> Hi all,> > It seems that the matrix multiplication in ublas ends up with the> trivial algorithm.  On my machine, even the following function> outperforms it for square matrices bigger than 173*173 (by a huge> margin for matrices bigger than 190*190), while not performing> considerably worse for smaller matrices:> > matrix> matmul_byrow(const matrix &lhs, const matrix &rhs)> {>  assert(lhs.size2() == rhs.size1());>  matrix rv(lhs.size1(), rhs.size2());>  matrix r = trans(rhs);>  for (unsigned c = 0; c < rhs.size2(); c++)>    {>      matrix_column > out(rv, c);>      matrix_row > in(r, c);>      out = prod(lhs, in);>    }>  return rv;> }> > > Is there anybody working on improving the matrix multiplication > performance?> > If not, then I can try to find some spare cycles ...> > Cheers,> > Imre Palik> _______________________________________________> ublas mailing list> [hidden email]> http://lists.boost.org/mailman/listinfo.cgi/ublas> Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put thecode on a website:http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote:Hi Michael,I cannot see any attachments  ... On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote: Is there a public git repo for ublas 2.0? On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote: Hi Palik,this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a non-problem in the future.On 2016-01-17 21:23, palik imre wrote:> Hi all,> > It seems that the matrix multiplication in ublas ends up with the> trivial algorithm.  On my machine, even the following function> outperforms it for square matrices bigger than 173*173 (by a huge> margin for matrices bigger than 190*190), while not performing> considerably worse for smaller matrices:> > matrix> matmul_byrow(const matrix &lhs, const matrix &rhs)> {>  assert(lhs.size2() == rhs.size1());>  matrix rv(lhs.size1(), rhs.size2());>  matrix r = trans(rhs);>  for (unsigned c = 0; c < rhs.size2(); c++)>    {>      matrix_column > out(rv, c);>      matrix_row > in(r, c);>      out = prod(lhs, in);>    }>  return rv;> }> > > Is there anybody working on improving the matrix multiplication > performance?> > If not, then I can try to find some spare cycles ...> > Cheers,> > Imre Palik> _______________________________________________> ublas mailing list> [hidden email]> http://lists.boost.org/mailman/listinfo.cgi/ublas> Sent to: [hidden email] _______________________________________________ublas mailing list[hidden email]http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 Hello,I think that some development about a new product algorithm was done in the branch https://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication.Thanks and Best Regards,Joaquim Duran2016-01-19 15:41 GMT+01:00 Michael Lehn :Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put thecode on a website:http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote:Hi Michael,I cannot see any attachments  ... On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote: Is there a public git repo for ublas 2.0? On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote: Hi Palik,this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a non-problem in the future.On 2016-01-17 21:23, palik imre wrote:> Hi all,> > It seems that the matrix multiplication in ublas ends up with the> trivial algorithm.  On my machine, even the following function> outperforms it for square matrices bigger than 173*173 (by a huge> margin for matrices bigger than 190*190), while not performing> considerably worse for smaller matrices:> > matrix> matmul_byrow(const matrix &lhs, const matrix &rhs)> {>  assert(lhs.size2() == rhs.size1());>  matrix rv(lhs.size1(), rhs.size2());>  matrix r = trans(rhs);>  for (unsigned c = 0; c < rhs.size2(); c++)>    {>      matrix_column > out(rv, c);>      matrix_row > in(r, c);>      out = prod(lhs, in);>    }>  return rv;> }> > > Is there anybody working on improving the matrix multiplication > performance?> > If not, then I can try to find some spare cycles ...> > Cheers,> > Imre Palik> _______________________________________________> ublas mailing list> [hidden email]> http://lists.boost.org/mailman/listinfo.cgi/ublas> Sent to: [hidden email] _______________________________________________ublas mailing list[hidden email]http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 So has anyone compared the performance yet?On 20 Jan 2016, at 00:47, Joaquim Duran Comas <[hidden email]> wrote:Hello,I think that some development about a new product algorithm was done in the branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication.Thanks and Best Regards,Joaquim Duran2016-01-19 15:41 GMT+01:00 Michael Lehn :Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put thecode on a website:http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote:Hi Michael,I cannot see any attachments  ...On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote:Is there a public git repo for ublas 2.0?On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote:Hi Palik,this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a non-problem in the future.On 2016-01-17 21:23, palik imre wrote:> Hi all,> > It seems that the matrix multiplication in ublas ends up with the> trivial algorithm.  On my machine, even the following function> outperforms it for square matrices bigger than 173*173 (by a huge> margin for matrices bigger than 190*190), while not performing> considerably worse for smaller matrices:> > matrix> matmul_byrow(const matrix &lhs, const matrix &rhs)> {>  assert(lhs.size2() == rhs.size1());>  matrix rv(lhs.size1(), rhs.size2());>  matrix r = trans(rhs);>  for (unsigned c = 0; c < rhs.size2(); c++)>    {>      matrix_column > out(rv, c);>      matrix_row > in(r, c);>      out = prod(lhs, in);>    }>  return rv;> }> > > Is there anybody working on improving the matrix multiplication > performance?> > If not, then I can try to find some spare cycles ...> > Cheers,> > Imre Palik> _______________________________________________> ublas mailing list> [hidden email]> http://lists.boost.org/mailman/listinfo.cgi/ublas> Sent to: [hidden email]_______________________________________________ublas mailing list[hidden email]http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]_______________________________________________ublas mailing list[hidden email]http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]_______________________________________________ublas mailing list[hidden email]http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 Hello all, The algorithm in this branch is aiming at cascading kernels and is by no means complete. Each kernel represents various sizes of the block matrix used in the inner loop of the multiplication. The multiplication is performed by picking the kernel that would provide the highest performance for certain matrices dimensions ( I think this has not been pushed in this branch). It is considerably faster (last time I checked) from typical block algorithms that are using fixed size block sizes, especially for non-square cases. In square cases it is more or less the same as typical - explicit vectorization - block based - algorithms. This applies when comparing to most linalg libraries out there. It is also trivially paralellizable. There are some issues in fully bringing this into ublas, but they can be probably dealt with. Here are a few: 1. Different CPU architectures require different block sizes and different kernel cascade architectures. Hence there is the need for a facility to define these at compile time. 2. Statically cascading the kernels end into large compilation times and large executables. I haven't performed any benchmarks using dynamic block sizes to avoid these issues. 3. I haven't been able to find a systematic way of figuring out the cascade, but to run benchmark tests on each CPU architecture and create an ordered list. These benchmarks indicate that there are better block size choices than the ones that can be computed by the rational in (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).     4. I never run (3) to build a cascade using different number of threads on SMP machines.    5. There is the need for proper introspection so that the product operates only on ublas::matrix(es) and not generally on matrix_expression(s) I am afraid I don't have time to fully realize the implementation at the moment, especially on how to deal elegantly with (1) and (2). As a first start Michael Lehn's algorithm could be implemented (and it is much much faster than the current simplistic implementation). Michael: Have you checked the performance against Blaze for a few non-square matrix multiplications? - Nasos On 01/20/2016 03:46 PM, Michael Lehn wrote: So has anyone compared the performance yet? On 20 Jan 2016, at 00:47, Joaquim Duran Comas <[hidden email]> wrote: Hello, I think that some development about a new product algorithm was done in the branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication. Thanks and Best Regards, Joaquim Duran 2016-01-19 15:41 GMT+01:00 Michael Lehn : Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put the code on a website: http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/ On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote: Hi Michael, I cannot see any attachments  ... On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote: Is there a public git repo for ublas 2.0? On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote: Hi Palik, this is a known problem. In your case you should already get better  performance when using axpy_prod instead of prod. There are currently  moves towards a ublas 2.0 which should make this a non-problem in the  future. On 2016-01-17 21:23, palik imre wrote: > Hi all, >  > It seems that the matrix multiplication in ublas ends up with the > trivial algorithm.  On my machine, even the following function > outperforms it for square matrices bigger than 173*173 (by a huge > margin for matrices bigger than 190*190), while not performing > considerably worse for smaller matrices: >  > matrix > matmul_byrow(const matrix &lhs, const matrix &rhs) > { >  assert(lhs.size2() == rhs.size1()); >  matrix rv(lhs.size1(), rhs.size2()); >  matrix r = trans(rhs); >  for (unsigned c = 0; c < rhs.size2(); c++) >    { >      matrix_column > out(rv, c); >      matrix_row > in(r, c); >      out = prod(lhs, in); >    } >  return rv; > } >  >  > Is there anybody working on improving the matrix multiplication  > performance? >  > If not, then I can try to find some spare cycles ... >  > Cheers, >  > Imre Palik > _______________________________________________ > ublas mailing list > [hidden email] > http://lists.boost.org/mailman/listinfo.cgi/ublas > Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] ```_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email]``` _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 Hi Nasos,first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based onFor a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit moresophisticated but still not too hard.  The demo I posted does not use micro kernels that exploit SSE, AVX orFMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrotemy own micro-kernels but the interface of function ugemm is compatible to BLIS.Maybe you could help me to integrate your code in the benchmark example I posted above.About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require atuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product.IMHO they only have tuned the “easy” stuff like BLAS Level1 and Level2.  In that case it makes moresense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,let me know.About the block size: In my experience you get better performance if you chose them dynamically at runtimedepending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 forblocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and havingblocks of similar sizes easily pays of.Cheers,MichaelOn 21 Jan 2016, at 22:41, nasos <[hidden email]> wrote: Hello all, The algorithm in this branch is aiming at cascading kernels and is by no means complete. Each kernel represents various sizes of the block matrix used in the inner loop of the multiplication. The multiplication is performed by picking the kernel that would provide the highest performance for certain matrices dimensions ( I think this has not been pushed in this branch). It is considerably faster (last time I checked) from typical block algorithms that are using fixed size block sizes, especially for non-square cases. In square cases it is more or less the same as typical - explicit vectorization - block based - algorithms. This applies when comparing to most linalg libraries out there. It is also trivially paralellizable. There are some issues in fully bringing this into ublas, but they can be probably dealt with. Here are a few: 1. Different CPU architectures require different block sizes and different kernel cascade architectures. Hence there is the need for a facility to define these at compile time. 2. Statically cascading the kernels end into large compilation times and large executables. I haven't performed any benchmarks using dynamic block sizes to avoid these issues. 3. I haven't been able to find a systematic way of figuring out the cascade, but to run benchmark tests on each CPU architecture and create an ordered list. These benchmarks indicate that there are better block size choices than the ones that can be computed by the rational in (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).     4. I never run (3) to build a cascade using different number of threads on SMP machines.    5. There is the need for proper introspection so that the product operates only on ublas::matrix(es) and not generally on matrix_expression(s) I am afraid I don't have time to fully realize the implementation at the moment, especially on how to deal elegantly with (1) and (2). As a first start Michael Lehn's algorithm could be implemented (and it is much much faster than the current simplistic implementation). Michael: Have you checked the performance against Blaze for a few non-square matrix multiplications? - Nasos On 01/20/2016 03:46 PM, Michael Lehn wrote: So has anyone compared the performance yet? On 20 Jan 2016, at 00:47, Joaquim Duran Comas <[hidden email][hidden email]> wrote: Hello, I think that some development about a new product algorithm was done in the branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication. Thanks and Best Regards, Joaquim Duran 2016-01-19 15:41 GMT+01:00 Michael Lehn : Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put the code on a website: http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/ On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote: Hi Michael, I cannot see any attachments  ... On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email][hidden email]> wrote: Is there a public git repo for ublas 2.0? On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email][hidden email]> wrote: Hi Palik, this is a known problem. In your case you should already get better  performance when using axpy_prod instead of prod. There are currently  moves towards a ublas 2.0 which should make this a non-problem in the  future. On 2016-01-17 21:23, palik imre wrote: > Hi all, >  > It seems that the matrix multiplication in ublas ends up with the > trivial algorithm.  On my machine, even the following function > outperforms it for square matrices bigger than 173*173 (by a huge > margin for matrices bigger than 190*190), while not performing > considerably worse for smaller matrices: >  > matrix > matmul_byrow(const matrix &lhs, const matrix &rhs) > { >  assert(lhs.size2() == rhs.size1()); >  matrix rv(lhs.size1(), rhs.size2()); >  matrix r = trans(rhs); >  for (unsigned c = 0; c < rhs.size2(); c++) >    { >      matrix_column > out(rv, c); >      matrix_row > in(r, c); >      out = prod(lhs, in); >    } >  return rv; > } >  >  > Is there anybody working on improving the matrix multiplication  > performance? >  > If not, then I can try to find some spare cycles ... >  > Cheers, >  > Imre Palik > _______________________________________________ > ublas mailing list > [hidden email][hidden email] > http://lists.boost.org/mailman/listinfo.cgi/ublas > Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email][hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email][hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email][hidden email] ```_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email]``` _______________________________________________ublas mailing list[hidden email]http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based onand of course _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 In reply to this post by Michael Lehn-2 Michael, please see below On 01/21/2016 05:23 PM, Michael Lehn wrote: Hi Nasos, first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based on For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more sophisticated but still not too hard.  Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though. The demo I posted does not use micro kernels that exploit SSE, AVX or FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote my own micro-kernels but the interface of function ugemm is compatible to BLIS. If you compile with -O3 I think you are getting  near optimal SSE vectorization. gcc is truly impressive and intel is even more. Maybe you could help me to integrate your code in the benchmark example I posted above. I will try to find some time to spend on the code. About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product. I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL (https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication) IMHO they only have tuned the “easy” stuff like BLAS Level1 and Level2.  In that case it makes more sense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong, let me know. About the block size: In my experience you get better performance if you chose them dynamically at runtime depending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 for blocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR. I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and having blocks of similar sizes easily pays of. Cheers, Michael On 21 Jan 2016, at 22:41, nasos <[hidden email]> wrote: Hello all, The algorithm in this branch is aiming at cascading kernels and is by no means complete. Each kernel represents various sizes of the block matrix used in the inner loop of the multiplication. The multiplication is performed by picking the kernel that would provide the highest performance for certain matrices dimensions ( I think this has not been pushed in this branch). It is considerably faster (last time I checked) from typical block algorithms that are using fixed size block sizes, especially for non-square cases. In square cases it is more or less the same as typical - explicit vectorization - block based - algorithms. This applies when comparing to most linalg libraries out there. It is also trivially paralellizable. There are some issues in fully bringing this into ublas, but they can be probably dealt with. Here are a few: 1. Different CPU architectures require different block sizes and different kernel cascade architectures. Hence there is the need for a facility to define these at compile time. 2. Statically cascading the kernels end into large compilation times and large executables. I haven't performed any benchmarks using dynamic block sizes to avoid these issues. 3. I haven't been able to find a systematic way of figuring out the cascade, but to run benchmark tests on each CPU architecture and create an ordered list. These benchmarks indicate that there are better block size choices than the ones that can be computed by the rational in (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).     4. I never run (3) to build a cascade using different number of threads on SMP machines.    5. There is the need for proper introspection so that the product operates only on ublas::matrix(es) and not generally on matrix_expression(s) I am afraid I don't have time to fully realize the implementation at the moment, especially on how to deal elegantly with (1) and (2). As a first start Michael Lehn's algorithm could be implemented (and it is much much faster than the current simplistic implementation). Michael: Have you checked the performance against Blaze for a few non-square matrix multiplications? - Nasos On 01/20/2016 03:46 PM, Michael Lehn wrote: So has anyone compared the performance yet? On 20 Jan 2016, at 00:47, Joaquim Duran Comas <[hidden email]> wrote: Hello, I think that some development about a new product algorithm was done in the branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication. Thanks and Best Regards, Joaquim Duran 2016-01-19 15:41 GMT+01:00 Michael Lehn : Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put the code on a website: http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/ On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote: Hi Michael, I cannot see any attachments  ... On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote: Is there a public git repo for ublas 2.0? On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote: Hi Palik, this is a known problem. In your case you should already get better  performance when using axpy_prod instead of prod. There are currently  moves towards a ublas 2.0 which should make this a non-problem in the  future. On 2016-01-17 21:23, palik imre wrote: > Hi all, >  > It seems that the matrix multiplication in ublas ends up with the > trivial algorithm.  On my machine, even the following function > outperforms it for square matrices bigger than 173*173 (by a huge > margin for matrices bigger than 190*190), while not performing > considerably worse for smaller matrices: >  > matrix > matmul_byrow(const matrix &lhs, const matrix &rhs) > { >  assert(lhs.size2() == rhs.size1()); >  matrix rv(lhs.size1(), rhs.size2()); >  matrix r = trans(rhs); >  for (unsigned c = 0; c < rhs.size2(); c++) >    { >      matrix_column > out(rv, c); >      matrix_row > in(r, c); >      out = prod(lhs, in); >    } >  return rv; > } >  >  > Is there anybody working on improving the matrix multiplication  > performance? >  > If not, then I can try to find some spare cycles ... >  > Cheers, >  > Imre Palik > _______________________________________________ > ublas mailing list > [hidden email] > http://lists.boost.org/mailman/listinfo.cgi/ublas > Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] ```_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email]``` _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] ```_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email]``` _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 On 22 Jan 2016, at 00:28, nasos <[hidden email]> wrote:Michael,please see belowOn 01/21/2016 05:23 PM, Michael Lehn wrote:Hi Nasos,first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based onFor a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit moresophisticated but still not too hard.  Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though.The demo I posted does not use micro kernels that exploit SSE, AVX orFMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrotemy own micro-kernels but the interface of function ugemm is compatible to BLIS.If you compile with -O3 I think you are getting  near optimal SSE vectorization. gcc is truly impressive and intel is even more.No, believe me.  No chance to beat asm :-)_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 On 22 Jan 2016, at 00:30, Michael Lehn <[hidden email]> wrote:On 22 Jan 2016, at 00:28, nasos <[hidden email]> wrote:Michael,please see belowOn 01/21/2016 05:23 PM, Michael Lehn wrote:Hi Nasos,first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based onFor a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit moresophisticated but still not too hard.  Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though.The demo I posted does not use micro kernels that exploit SSE, AVX orFMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrotemy own micro-kernels but the interface of function ugemm is compatible to BLIS.If you compile with -O3 I think you are getting  near optimal SSE vectorization. gcc is truly impressive and intel is even more.No, believe me.  No chance to beat asm :-)Have a look herefor architectures with AVX or FMA its even more impressive.  The asm micro kernels are more then just exploiting registers.  Ifyou really want to achieve peak performance you have to do actually math on asm-level.  I can extend the page with SSE, AVXand FMA micro kernels tomorrow.  The performance boost is significant.  Something like factor 3 for SSE and factor 5 for FMA.I attached a benchmark from my lecture (http://www.mathematik.uni-ulm.de/numerik/hpc/ws15/uebungen/index.html).  It compareson a Intel i5 the GEMM performance:- "Blocked Session 8" is basically what I posted here- “Blocked Session 8 + AVX” replaced “ugemm” with a asm implementation But anyway.  I think we need a common base for doing benchmarks, so you and others can convince yourself on your own hardware._______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email] bench2.gemm.mflops.svg (33K) Download Attachment
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 In reply to this post by Nasos Iliopoulos About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product. I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL (https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication) They use a log scale for the benchmarks.  IMHO that does not make any sense.  On this benchmark they onlyare better for matrix dimensions smaller than 100.  Even if you run the same implementations twice you getfluctuations of that magnitude.  At 1000 its identical.  And outside of the C++ word I never saw log-scales forMFLOPS benchmarks.  It makes sense if you compare the runtime of a O(N^k) algorithm.  But I don’t see thepoint for illustrating performance.  All this started with the BTL (Benchmark Template Library).But I will look into the BLAZE code to make sure (as in proof).   _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 In reply to this post by palik imre Hi Michael,as for the block size, you could choose it somewhat better.Both MR and NR should be at least the size of a L1 cacheline (i.e., 16 for double and 32 for float on Intel), moreover you should try to align your blocks to L1 cacheline (e.g. using posix_memalign(), or the cache aligned allocator from TBB) This alone brought something like a 50% increase in performance in my tests.  This also means that the original implementation stops being competitive around at around 50*50 squares, as opposed to 75*75.I have some ideas about the other parameters too + I will plan to have a look with perf during the weekend.Dynamic vs. static block size: A wild guess would be that dynamic probably worths to do above a different size.  Should measure though.Cheers,Imre On Friday, 22 January 2016, 0:30, "[hidden email]" <[hidden email]> wrote:From: Michael Lehn <[hidden email]>To: ublas mailing list <[hidden email]>Subject: Re: [ublas] Matrix multiplication performanceMessage-ID: <[hidden email]>Content-Type: text/plain; charset="windows-1252"Hi Nasos,first of all I don?t want to take wrong credits and want to point out that this is not my algorithm.  It is based on    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf    https://github.com/flame/blisFor a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit moresophisticated but still not too hard.  The demo I posted does not use micro kernels that exploit SSE, AVX orFMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrotemy own micro-kernels but the interface of function ugemm is compatible to BLIS.Maybe you could help me to integrate your code in the benchmark example I posted above.About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require atuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2.  In that case it makes moresense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,let me know.About the block size: In my experience you get better performance if you chose them dynamically at runtimedepending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 forblocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and havingblocks of similar sizes easily pays of.Cheers,Michael _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
Open this post in threaded view
|
Report Content as Inappropriate

## Re: Matrix multiplication performance

 In reply to this post by Nasos Iliopoulos On 22 Jan 2016, at 00:28, nasos <[hidden email]> wrote: Michael, please see below On 01/21/2016 05:23 PM, Michael Lehn wrote: Hi Nasos, first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based on For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more sophisticated but still not too hard.  Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though. The demo I posted does not use micro kernels that exploit SSE, AVX or FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote my own micro-kernels but the interface of function ugemm is compatible to BLIS. If you compile with -O3 I think you are getting  near optimal SSE vectorization. gcc is truly impressive and intel is even more. Maybe you could help me to integrate your code in the benchmark example I posted above. I will try to find some time to spend on the code. About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product. I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL (https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication) I started today with similar experiments on BLAZE and had closer look at their internal implementation.  By defaultthey are calling an external BLAS backend.  On my machine I used the Intel MKL.   But you are right, they also havean internal implementation that can be used if no external BLAS is available.  I will publish the results on this page:At the moment the benchmarks for the internal BLAZE implementation for the matrix-matrix product seem to lookpoor.  I asked Klaus Iglberger (the author of BLAZE) to check the compiler flags that I have used.  So don’t take thecurrent results as-is.Cheers,Michael  _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]