Matrix multiplication performance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Matrix multiplication performance

palik imre
Hi all,

It seems that the matrix multiplication in ublas ends up with the trivial algorithm.  On my machine, even the following function outperforms it for square matrices bigger than 173*173 (by a huge margin for matrices bigger than 190*190), while not performing considerably worse for smaller matrices:

matrix<double>
matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
{
  assert(lhs.size2() == rhs.size1());
  matrix<double> rv(lhs.size1(), rhs.size2());
  matrix<double> r = trans(rhs);
  for (unsigned c = 0; c < rhs.size2(); c++)
    {
      matrix_column<matrix<double> > out(rv, c);
      matrix_row<matrix<double> > in(r, c);
      out = prod(lhs, in);
    }
  return rv;
}


Is there anybody working on improving the matrix multiplication performance?

If not, then I can try to find some spare cycles ...

Cheers,

Imre Palik
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Oswin Krause
Hi Palik,

this is a known problem. In your case you should already get better
performance when using axpy_prod instead of prod. There are currently
moves towards a ublas 2.0 which should make this a non-problem in the
future.


On 2016-01-17 21:23, palik imre wrote:

> Hi all,
>
> It seems that the matrix multiplication in ublas ends up with the
> trivial algorithm.  On my machine, even the following function
> outperforms it for square matrices bigger than 173*173 (by a huge
> margin for matrices bigger than 190*190), while not performing
> considerably worse for smaller matrices:
>
> matrix<double>
> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
> {
>   assert(lhs.size2() == rhs.size1());
>   matrix<double> rv(lhs.size1(), rhs.size2());
>   matrix<double> r = trans(rhs);
>   for (unsigned c = 0; c < rhs.size2(); c++)
>     {
>       matrix_column<matrix<double> > out(rv, c);
>       matrix_row<matrix<double> > in(r, c);
>       out = prod(lhs, in);
>     }
>   return rv;
> }
>
>
> Is there anybody working on improving the matrix multiplication
> performance?
>
> If not, then I can try to find some spare cycles ...
>
> Cheers,
>
> Imre Palik
> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Michael Lehn-2
Hi Palik and Oswin,

I attached a simple implementation of a matrix-matrix product (gemm.hpp) which has less
then 250 lines of code.  I also attached a simple test programm (matprod.cc).  The test
Programm has a simple wrapper for the implementation in gemm.hpp.  For a real applications
you should use traits like for the BLAS bindings.  You can compile it with

g++ -O3 -Wall -std=c++11 -I $BOOST_PATH matprod.cc

where BOOST_PATH contains your boost implementation.  Let me know if this works
for you.


Cheers,

Michael











-----------------------------------------------------------------------------------
Dr. Michael Lehn
University of Ulm, Institute for Numerical Mathematics
Helmholtzstr. 20
D-89069 Ulm, Germany
Phone: (+49) 731 50-23534, Fax: (+49) 731 50-23548
-----------------------------------------------------------------------------------

On 18 Jan 2016, at 09:25, Oswin Krause <[hidden email]> wrote:

Hi Palik,

this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a non-problem in the future.


On 2016-01-17 21:23, palik imre wrote:
Hi all,
It seems that the matrix multiplication in ublas ends up with the
trivial algorithm.  On my machine, even the following function
outperforms it for square matrices bigger than 173*173 (by a huge
margin for matrices bigger than 190*190), while not performing
considerably worse for smaller matrices:
matrix<double>
matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
{
 assert(lhs.size2() == rhs.size1());
 matrix<double> rv(lhs.size1(), rhs.size2());
 matrix<double> r = trans(rhs);
 for (unsigned c = 0; c < rhs.size2(); c++)
   {
     matrix_column<matrix<double> > out(rv, c);
     matrix_row<matrix<double> > in(r, c);
     out = prod(lhs, in);
   }
 return rv;
}
Is there anybody working on improving the matrix multiplication performance?
If not, then I can try to find some spare cycles ...
Cheers,
Imre Palik
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

gemm.hpp (7K) Download Attachment
matprod.cc (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

palik imre
In reply to this post by Oswin Krause
Is there a public git repo for ublas 2.0?


On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote:


Hi Palik,

this is a known problem. In your case you should already get better
performance when using axpy_prod instead of prod. There are currently
moves towards a ublas 2.0 which should make this a non-problem in the
future.


On 2016-01-17 21:23, palik imre wrote:

> Hi all,
>
> It seems that the matrix multiplication in ublas ends up with the
> trivial algorithm.  On my machine, even the following function
> outperforms it for square matrices bigger than 173*173 (by a huge
> margin for matrices bigger than 190*190), while not performing
> considerably worse for smaller matrices:
>
> matrix<double>
> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
> {
>  assert(lhs.size2() == rhs.size1());
>  matrix<double> rv(lhs.size1(), rhs.size2());
>  matrix<double> r = trans(rhs);
>  for (unsigned c = 0; c < rhs.size2(); c++)
>    {
>      matrix_column<matrix<double> > out(rv, c);
>      matrix_row<matrix<double> > in(r, c);
>      out = prod(lhs, in);
>    }
>  return rv;
> }
>
>
> Is there anybody working on improving the matrix multiplication
> performance?
>
> If not, then I can try to find some spare cycles ...
>
> Cheers,
>
> Imre Palik

> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Nasos Iliopoulos
Nope

On 01/19/2016 05:12 AM, palik imre wrote:
> s there a public git repo for ublas 2.0?

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

palik imre
In reply to this post by palik imre
Hi Michael,

I cannot see any attachments  ...


On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote:


Is there a public git repo for ublas 2.0?


On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote:


Hi Palik,

this is a known problem. In your case you should already get better
performance when using axpy_prod instead of prod. There are currently
moves towards a ublas 2.0 which should make this a non-problem in the
future.


On 2016-01-17 21:23, palik imre wrote:

> Hi all,
>
> It seems that the matrix multiplication in ublas ends up with the
> trivial algorithm.  On my machine, even the following function
> outperforms it for square matrices bigger than 173*173 (by a huge
> margin for matrices bigger than 190*190), while not performing
> considerably worse for smaller matrices:
>
> matrix<double>
> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
> {
>  assert(lhs.size2() == rhs.size1());
>  matrix<double> rv(lhs.size1(), rhs.size2());
>  matrix<double> r = trans(rhs);
>  for (unsigned c = 0; c < rhs.size2(); c++)
>    {
>      matrix_column<matrix<double> > out(rv, c);
>      matrix_row<matrix<double> > in(r, c);
>      out = prod(lhs, in);
>    }
>  return rv;
> }
>
>
> Is there anybody working on improving the matrix multiplication
> performance?
>
> If not, then I can try to find some spare cycles ...
>
> Cheers,
>
> Imre Palik

> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]





_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Michael Lehn-2
Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put the
code on a website:

http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/



On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote:

Hi Michael,

I cannot see any attachments  ...


On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote:


Is there a public git repo for ublas 2.0?


On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote:


Hi Palik,

this is a known problem. In your case you should already get better
performance when using axpy_prod instead of prod. There are currently
moves towards a ublas 2.0 which should make this a non-problem in the
future.


On 2016-01-17 21:23, palik imre wrote:

> Hi all,
>
> It seems that the matrix multiplication in ublas ends up with the
> trivial algorithm.  On my machine, even the following function
> outperforms it for square matrices bigger than 173*173 (by a huge
> margin for matrices bigger than 190*190), while not performing
> considerably worse for smaller matrices:
>
> matrix<double>
> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
> {
>  assert(lhs.size2() == rhs.size1());
>  matrix<double> rv(lhs.size1(), rhs.size2());
>  matrix<double> r = trans(rhs);
>  for (unsigned c = 0; c < rhs.size2(); c++)
>    {
>      matrix_column<matrix<double> > out(rv, c);
>      matrix_row<matrix<double> > in(r, c);
>      out = prod(lhs, in);
>    }
>  return rv;
> }
>
>
> Is there anybody working on improving the matrix multiplication
> performance?
>
> If not, then I can try to find some spare cycles ...
>
> Cheers,
>
> Imre Palik

> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]




_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Joaquim Duran Comas
Hello,

I think that some development about a new product algorithm was done in the branch https://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication.

Thanks and Best Regards,
Joaquim Duran


2016-01-19 15:41 GMT+01:00 Michael Lehn <[hidden email]>:
Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put the
code on a website:

http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/



On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote:

Hi Michael,

I cannot see any attachments  ...


On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote:


Is there a public git repo for ublas 2.0?


On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote:


Hi Palik,

this is a known problem. In your case you should already get better
performance when using axpy_prod instead of prod. There are currently
moves towards a ublas 2.0 which should make this a non-problem in the
future.


On 2016-01-17 21:23, palik imre wrote:

> Hi all,
>
> It seems that the matrix multiplication in ublas ends up with the
> trivial algorithm.  On my machine, even the following function
> outperforms it for square matrices bigger than 173*173 (by a huge
> margin for matrices bigger than 190*190), while not performing
> considerably worse for smaller matrices:
>
> matrix<double>
> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
> {
>  assert(lhs.size2() == rhs.size1());
>  matrix<double> rv(lhs.size1(), rhs.size2());
>  matrix<double> r = trans(rhs);
>  for (unsigned c = 0; c < rhs.size2(); c++)
>    {
>      matrix_column<matrix<double> > out(rv, c);
>      matrix_row<matrix<double> > in(r, c);
>      out = prod(lhs, in);
>    }
>  return rv;
> }
>
>
> Is there anybody working on improving the matrix multiplication
> performance?
>
> If not, then I can try to find some spare cycles ...
>
> Cheers,
>
> Imre Palik

> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]




_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Michael Lehn-2
So has anyone compared the performance yet?


On 20 Jan 2016, at 00:47, Joaquim Duran Comas <[hidden email]> wrote:

Hello,

I think that some development about a new product algorithm was done in the branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication.

Thanks and Best Regards,
Joaquim Duran


2016-01-19 15:41 GMT+01:00 Michael Lehn <[hidden email]>:
Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put the
code on a website:

http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/



On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote:

Hi Michael,

I cannot see any attachments  ...


On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote:


Is there a public git repo for ublas 2.0?


On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote:


Hi Palik,

this is a known problem. In your case you should already get better 
performance when using axpy_prod instead of prod. There are currently 
moves towards a ublas 2.0 which should make this a non-problem in the 
future.


On 2016-01-17 21:23, palik imre wrote:

> Hi all,
> 
> It seems that the matrix multiplication in ublas ends up with the
> trivial algorithm.  On my machine, even the following function
> outperforms it for square matrices bigger than 173*173 (by a huge
> margin for matrices bigger than 190*190), while not performing
> considerably worse for smaller matrices:
> 
> matrix<double>
> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
> {
>  assert(lhs.size2() == rhs.size1());
>  matrix<double> rv(lhs.size1(), rhs.size2());
>  matrix<double> r = trans(rhs);
>  for (unsigned c = 0; c < rhs.size2(); c++)
>    {
>      matrix_column<matrix<double> > out(rv, c);
>      matrix_row<matrix<double> > in(r, c);
>      out = prod(lhs, in);
>    }
>  return rv;
> }
> 
> 
> Is there anybody working on improving the matrix multiplication 
> performance?
> 
> If not, then I can try to find some spare cycles ...
> 
> Cheers,
> 
> Imre Palik

> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]




_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Nasos Iliopoulos
Hello all,
The algorithm in this branch is aiming at cascading kernels and is by no means complete. Each kernel represents various sizes of the block matrix used in the inner loop of the multiplication. The multiplication is performed by picking the kernel that would provide the highest performance for certain matrices dimensions ( I think this has not been pushed in this branch).

It is considerably faster (last time I checked) from typical block algorithms that are using fixed size block sizes, especially for non-square cases. In square cases it is more or less the same as typical - explicit vectorization - block based - algorithms. This applies when comparing to most linalg libraries out there. It is also trivially paralellizable.

There are some issues in fully bringing this into ublas, but they can be probably dealt with. Here are a few:

1. Different CPU architectures require different block sizes and different kernel cascade architectures. Hence there is the need for a facility to define these at compile time.

2. Statically cascading the kernels end into large compilation times and large executables. I haven't performed any benchmarks using dynamic block sizes to avoid these issues.

3. I haven't been able to find a systematic way of figuring out the cascade, but to run benchmark tests on each CPU architecture and create an ordered list. These benchmarks indicate that there are better block size choices than the ones that can be computed by the rational in (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).    

4. I never run (3) to build a cascade using different number of threads on SMP machines. 
 
5. There is the need for proper introspection so that the product operates only on ublas::matrix(es) and not generally on matrix_expression(s)

I am afraid I don't have time to fully realize the implementation at the moment, especially on how to deal elegantly with (1) and (2).

As a first start Michael Lehn's algorithm could be implemented (and it is much much faster than the current simplistic implementation).

Michael: Have you checked the performance against Blaze for a few non-square matrix multiplications?

- Nasos


On 01/20/2016 03:46 PM, Michael Lehn wrote:
So has anyone compared the performance yet?


On 20 Jan 2016, at 00:47, Joaquim Duran Comas <[hidden email]> wrote:

Hello,

I think that some development about a new product algorithm was done in the branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication.

Thanks and Best Regards,
Joaquim Duran


2016-01-19 15:41 GMT+01:00 Michael Lehn <[hidden email]>:
Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put the
code on a website:

http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/



On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote:

Hi Michael,

I cannot see any attachments  ...


On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote:


Is there a public git repo for ublas 2.0?


On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote:


Hi Palik,

this is a known problem. In your case you should already get better 
performance when using axpy_prod instead of prod. There are currently 
moves towards a ublas 2.0 which should make this a non-problem in the 
future.


On 2016-01-17 21:23, palik imre wrote:
> Hi all,
> 
> It seems that the matrix multiplication in ublas ends up with the
> trivial algorithm.  On my machine, even the following function
> outperforms it for square matrices bigger than 173*173 (by a huge
> margin for matrices bigger than 190*190), while not performing
> considerably worse for smaller matrices:
> 
> matrix<double>
> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
> {
>  assert(lhs.size2() == rhs.size1());
>  matrix<double> rv(lhs.size1(), rhs.size2());
>  matrix<double> r = trans(rhs);
>  for (unsigned c = 0; c < rhs.size2(); c++)
>    {
>      matrix_column<matrix<double> > out(rv, c);
>      matrix_row<matrix<double> > in(r, c);
>      out = prod(lhs, in);
>    }
>  return rv;
> }
> 
> 
> Is there anybody working on improving the matrix multiplication 
> performance?
> 
> If not, then I can try to find some spare cycles ...
> 
> Cheers,
> 
> Imre Palik

> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]





_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Michael Lehn-2
Hi Nasos,

first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based on



For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.  The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

Maybe you could help me to integrate your code in the benchmark example I posted above.

About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product.
IMHO they only have tuned the “easy” stuff like BLAS Level1 and Level2.  In that case it makes more
sense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,
let me know.

About the block size: In my experience you get better performance if you chose them dynamically at runtime
depending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 for
blocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and having
blocks of similar sizes easily pays of.

Cheers,

Michael


On 21 Jan 2016, at 22:41, nasos <[hidden email]> wrote:

Hello all,
The algorithm in this branch is aiming at cascading kernels and is by no means complete. Each kernel represents various sizes of the block matrix used in the inner loop of the multiplication. The multiplication is performed by picking the kernel that would provide the highest performance for certain matrices dimensions ( I think this has not been pushed in this branch).

It is considerably faster (last time I checked) from typical block algorithms that are using fixed size block sizes, especially for non-square cases. In square cases it is more or less the same as typical - explicit vectorization - block based - algorithms. This applies when comparing to most linalg libraries out there. It is also trivially paralellizable.

There are some issues in fully bringing this into ublas, but they can be probably dealt with. Here are a few:

1. Different CPU architectures require different block sizes and different kernel cascade architectures. Hence there is the need for a facility to define these at compile time.

2. Statically cascading the kernels end into large compilation times and large executables. I haven't performed any benchmarks using dynamic block sizes to avoid these issues.

3. I haven't been able to find a systematic way of figuring out the cascade, but to run benchmark tests on each CPU architecture and create an ordered list. These benchmarks indicate that there are better block size choices than the ones that can be computed by the rational in (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).    

4. I never run (3) to build a cascade using different number of threads on SMP machines. 
 
5. There is the need for proper introspection so that the product operates only on ublas::matrix(es) and not generally on matrix_expression(s)

I am afraid I don't have time to fully realize the implementation at the moment, especially on how to deal elegantly with (1) and (2).

As a first start Michael Lehn's algorithm could be implemented (and it is much much faster than the current simplistic implementation).

Michael: Have you checked the performance against Blaze for a few non-square matrix multiplications?

- Nasos


On 01/20/2016 03:46 PM, Michael Lehn wrote:
So has anyone compared the performance yet?


On 20 Jan 2016, at 00:47, Joaquim Duran Comas <[hidden email][hidden email]> wrote:

Hello,

I think that some development about a new product algorithm was done in the branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication.

Thanks and Best Regards,
Joaquim Duran


2016-01-19 15:41 GMT+01:00 Michael Lehn <[hidden email]>:
Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put the
code on a website:

http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/



On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote:

Hi Michael,

I cannot see any attachments  ...


On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email][hidden email]> wrote:


Is there a public git repo for ublas 2.0?


On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email][hidden email]> wrote:


Hi Palik,

this is a known problem. In your case you should already get better 
performance when using axpy_prod instead of prod. There are currently 
moves towards a ublas 2.0 which should make this a non-problem in the 
future.


On 2016-01-17 21:23, palik imre wrote:
> Hi all,
> 
> It seems that the matrix multiplication in ublas ends up with the
> trivial algorithm.  On my machine, even the following function
> outperforms it for square matrices bigger than 173*173 (by a huge
> margin for matrices bigger than 190*190), while not performing
> considerably worse for smaller matrices:
> 
> matrix<double>
> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
> {
>  assert(lhs.size2() == rhs.size1());
>  matrix<double> rv(lhs.size1(), rhs.size2());
>  matrix<double> r = trans(rhs);
>  for (unsigned c = 0; c < rhs.size2(); c++)
>    {
>      matrix_column<matrix<double> > out(rv, c);
>      matrix_row<matrix<double> > in(r, c);
>      out = prod(lhs, in);
>    }
>  return rv;
> }
> 
> 
> Is there anybody working on improving the matrix multiplication 
> performance?
> 
> If not, then I can try to find some spare cycles ...
> 
> Cheers,
> 
> Imre Palik

> _______________________________________________
> ublas mailing list
> [hidden email][hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]





_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email][hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email][hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email][hidden email]



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Michael Lehn-2
first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based on



and 


of course 




_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Nasos Iliopoulos
In reply to this post by Michael Lehn-2
Michael,
please see below

On 01/21/2016 05:23 PM, Michael Lehn wrote:
Hi Nasos,

first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based on



For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard. 
Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though.

The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

If you compile with -O3 I think you are getting  near optimal SSE vectorization. gcc is truly impressive and intel is even more.
Maybe you could help me to integrate your code in the benchmark example I posted above.

I will try to find some time to spend on the code.
About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product.
I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL (https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication)
IMHO they only have tuned the “easy” stuff like BLAS Level1 and Level2.  In that case it makes more
sense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,
let me know.

About the block size: In my experience you get better performance if you chose them dynamically at runtime
depending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 for
blocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and having
blocks of similar sizes easily pays of.

Cheers,

Michael


On 21 Jan 2016, at 22:41, nasos <[hidden email]> wrote:

Hello all,
The algorithm in this branch is aiming at cascading kernels and is by no means complete. Each kernel represents various sizes of the block matrix used in the inner loop of the multiplication. The multiplication is performed by picking the kernel that would provide the highest performance for certain matrices dimensions ( I think this has not been pushed in this branch).

It is considerably faster (last time I checked) from typical block algorithms that are using fixed size block sizes, especially for non-square cases. In square cases it is more or less the same as typical - explicit vectorization - block based - algorithms. This applies when comparing to most linalg libraries out there. It is also trivially paralellizable.

There are some issues in fully bringing this into ublas, but they can be probably dealt with. Here are a few:

1. Different CPU architectures require different block sizes and different kernel cascade architectures. Hence there is the need for a facility to define these at compile time.

2. Statically cascading the kernels end into large compilation times and large executables. I haven't performed any benchmarks using dynamic block sizes to avoid these issues.

3. I haven't been able to find a systematic way of figuring out the cascade, but to run benchmark tests on each CPU architecture and create an ordered list. These benchmarks indicate that there are better block size choices than the ones that can be computed by the rational in (http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).    

4. I never run (3) to build a cascade using different number of threads on SMP machines. 
 
5. There is the need for proper introspection so that the product operates only on ublas::matrix(es) and not generally on matrix_expression(s)

I am afraid I don't have time to fully realize the implementation at the moment, especially on how to deal elegantly with (1) and (2).

As a first start Michael Lehn's algorithm could be implemented (and it is much much faster than the current simplistic implementation).

Michael: Have you checked the performance against Blaze for a few non-square matrix multiplications?

- Nasos


On 01/20/2016 03:46 PM, Michael Lehn wrote:
So has anyone compared the performance yet?


On 20 Jan 2016, at 00:47, Joaquim Duran Comas <[hidden email]> wrote:

Hello,

I think that some development about a new product algorithm was done in the branchhttps://github.com/uBLAS/ublas/tree/ublas_feature0004_fast_matrix_multiplication.

Thanks and Best Regards,
Joaquim Duran


2016-01-19 15:41 GMT+01:00 Michael Lehn <[hidden email]>:
Sorry for the inconvenience.  I guess sending attachments to the mailing list is prohibited.  I put the
code on a website:

http://apfel.mathematik.uni-ulm.de/~lehn/test_ublas/



On 19 Jan 2016, at 15:14, palik imre <[hidden email]> wrote:

Hi Michael,

I cannot see any attachments  ...


On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote:


Is there a public git repo for ublas 2.0?


On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote:


Hi Palik,

this is a known problem. In your case you should already get better 
performance when using axpy_prod instead of prod. There are currently 
moves towards a ublas 2.0 which should make this a non-problem in the 
future.


On 2016-01-17 21:23, palik imre wrote:
> Hi all,
> 
> It seems that the matrix multiplication in ublas ends up with the
> trivial algorithm.  On my machine, even the following function
> outperforms it for square matrices bigger than 173*173 (by a huge
> margin for matrices bigger than 190*190), while not performing
> considerably worse for smaller matrices:
> 
> matrix<double>
> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
> {
>  assert(lhs.size2() == rhs.size1());
>  matrix<double> rv(lhs.size1(), rhs.size2());
>  matrix<double> r = trans(rhs);
>  for (unsigned c = 0; c < rhs.size2(); c++)
>    {
>      matrix_column<matrix<double> > out(rv, c);
>      matrix_row<matrix<double> > in(r, c);
>      out = prod(lhs, in);
>    }
>  return rv;
> }
> 
> 
> Is there anybody working on improving the matrix multiplication 
> performance?
> 
> If not, then I can try to find some spare cycles ...
> 
> Cheers,
> 
> Imre Palik

> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]





_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Michael Lehn-2

On 22 Jan 2016, at 00:28, nasos <[hidden email]> wrote:

Michael,
please see below

On 01/21/2016 05:23 PM, Michael Lehn wrote:
Hi Nasos,

first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based on



For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.  
Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though.

The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

If you compile with -O3 I think you are getting  near optimal SSE vectorization. gcc is truly impressive and intel is even more.

No, believe me.  No chance to beat asm :-)

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Michael Lehn-2

On 22 Jan 2016, at 00:30, Michael Lehn <[hidden email]> wrote:


On 22 Jan 2016, at 00:28, nasos <[hidden email]> wrote:

Michael,
please see below

On 01/21/2016 05:23 PM, Michael Lehn wrote:
Hi Nasos,

first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based on



For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.  
Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though.

The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

If you compile with -O3 I think you are getting  near optimal SSE vectorization. gcc is truly impressive and intel is even more.

No, believe me.  No chance to beat asm :-)

Have a look here


for architectures with AVX or FMA its even more impressive.  The asm micro kernels are more then just exploiting registers.  If
you really want to achieve peak performance you have to do actually math on asm-level.  I can extend the page with SSE, AVX
and FMA micro kernels tomorrow.  The performance boost is significant.  Something like factor 3 for SSE and factor 5 for FMA.
I attached a benchmark from my lecture (http://www.mathematik.uni-ulm.de/numerik/hpc/ws15/uebungen/index.html).  It compares
on a Intel i5 the GEMM performance:

- "Blocked Session 8" is basically what I posted here
- “Blocked Session 8 + AVX” replaced “ugemm” with a asm implementation 

But anyway.  I think we need a common base for doing benchmarks, so you and others can convince yourself on your own hardware.








_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

bench2.gemm.mflops.svg (33K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Michael Lehn-2
In reply to this post by Nasos Iliopoulos

About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product.
I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL (https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication)

They use a log scale for the benchmarks.  IMHO that does not make any sense.  On this benchmark they only
are better for matrix dimensions smaller than 100.  Even if you run the same implementations twice you get
fluctuations of that magnitude.  At 1000 its identical.  And outside of the C++ word I never saw log-scales for
MFLOPS benchmarks.  It makes sense if you compare the runtime of a O(N^k) algorithm.  But I don’t see the
point for illustrating performance.  All this started with the BTL (Benchmark Template Library).

But I will look into the BLAZE code to make sure (as in proof).   

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

palik imre
In reply to this post by palik imre
Hi Michael,

as for the block size, you could choose it somewhat better.

Both MR and NR should be at least the size of a L1 cacheline (i.e., 16 for double and 32 for float on Intel), moreover you should try to align your blocks to L1 cacheline (e.g. using posix_memalign(), or the cache aligned allocator from TBB)

This alone brought something like a 50% increase in performance in my tests.  This also means that the original implementation stops being competitive around at around 50*50 squares, as opposed to 75*75.

I have some ideas about the other parameters too + I will plan to have a look with perf during the weekend.

Dynamic vs. static block size: A wild guess would be that dynamic probably worths to do above a different size.  Should measure though.

Cheers,

Imre


On Friday, 22 January 2016, 0:30, "[hidden email]" <[hidden email]> wrote:
From: Michael Lehn <[hidden email]>
To: ublas mailing list <[hidden email]>
Subject: Re: [ublas] Matrix multiplication performance
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="windows-1252"

Hi Nasos,

first of all I don?t want to take wrong credits and want to point out that this is not my algorithm.  It is based on

    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf

    https://github.com/flame/blis

For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.  The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

Maybe you could help me to integrate your code in the benchmark example I posted above.

About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.
IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2.  In that case it makes more
sense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,
let me know.

About the block size: In my experience you get better performance if you chose them dynamically at runtime
depending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 for
blocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and having
blocks of similar sizes easily pays of.

Cheers,

Michael



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Michael Lehn-2
In reply to this post by Nasos Iliopoulos

On 22 Jan 2016, at 00:28, nasos <[hidden email]> wrote:

Michael,
please see below

On 01/21/2016 05:23 PM, Michael Lehn wrote:
Hi Nasos,

first of all I don’t want to take wrong credits and want to point out that this is not my algorithm.  It is based on



For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard. 
Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though.

The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

If you compile with -O3 I think you are getting  near optimal SSE vectorization. gcc is truly impressive and intel is even more.
Maybe you could help me to integrate your code in the benchmark example I posted above.

I will try to find some time to spend on the code.
About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product.
I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL (https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication)

I started today with similar experiments on BLAZE and had closer look at their internal implementation.  By default
they are calling an external BLAS backend.  On my machine I used the Intel MKL.   But you are right, they also have
an internal implementation that can be used if no external BLAS is available.  I will publish the results on this page:


At the moment the benchmarks for the internal BLAZE implementation for the matrix-matrix product seem to look
poor.  I asked Klaus Iglberger (the author of BLAZE) to check the compiler flags that I have used.  So don’t take the
current results as-is.

Cheers,

Michael 
 



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Matrix multiplication performance

Michael Lehn-2

On 16 Feb 2016, at 17:09, Michael Lehn <[hidden email]> wrote:

>>>
>>> About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
>>> tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrix-matrix product.
>> I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL (https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication)
>
> I started today with similar experiments on BLAZE and had closer look at their internal implementation.  By default
> they are calling an external BLAS backend.  On my machine I used the Intel MKL.   But you are right, they also have
> an internal implementation that can be used if no external BLAS is available.  I will publish the results on this page:
>
> http://www.mathematik.uni-ulm.de/~lehn/test_blaze/index.html
>
> At the moment the benchmarks for the internal BLAZE implementation for the matrix-matrix product seem to look
> poor.  I asked Klaus Iglberger (the author of BLAZE) to check the compiler flags that I have used.  So don’t take the
> current results as-is.
>

I modified the benchmark code in favour for BLAZE as Klaus Iglberger’s suggested.  The results are now slightly
better.  However, compared to Intel MKL they are way off.  Similar results I got on (my) Haswell.  The problem seems
as stated by Klaus Iglberger:  "The optimization flags are fine. The performance difference is due to the lack of a
specifically optimized kernel.”  So on other platforms it might perform much better.




-----------------------------------------------------------------------------------
Dr. Michael Lehn
University of Ulm, Institute for Numerical Mathematics
Helmholtzstr. 20
D-89069 Ulm, Germany
Phone: (+49) 731 50-23534, Fax: (+49) 731 50-23548
-----------------------------------------------------------------------------------

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]