

Hi all,
It seems that the matrix multiplication in ublas ends up with the trivial algorithm. On my machine, even the following function outperforms it for square matrices bigger than 173*173 (by a huge margin for matrices bigger than 190*190), while not performing considerably worse for smaller matrices:
matrix<double>
matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
{
assert(lhs.size2() == rhs.size1());
matrix<double> rv(lhs.size1(), rhs.size2());
matrix<double> r = trans(rhs);
for (unsigned c = 0; c < rhs.size2(); c++)
{
matrix_column<matrix<double> > out(rv, c);
matrix_row<matrix<double> > in(r, c);
out = prod(lhs, in);
}
return rv;
}
Is there anybody working on improving the matrix multiplication performance?
If not, then I can try to find some spare cycles ...
Cheers,
Imre Palik
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


Hi Palik,
this is a known problem. In your case you should already get better
performance when using axpy_prod instead of prod. There are currently
moves towards a ublas 2.0 which should make this a nonproblem in the
future.
On 20160117 21:23, palik imre wrote:
> Hi all,
>
> It seems that the matrix multiplication in ublas ends up with the
> trivial algorithm. On my machine, even the following function
> outperforms it for square matrices bigger than 173*173 (by a huge
> margin for matrices bigger than 190*190), while not performing
> considerably worse for smaller matrices:
>
> matrix<double>
> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs)
> {
> assert(lhs.size2() == rhs.size1());
> matrix<double> rv(lhs.size1(), rhs.size2());
> matrix<double> r = trans(rhs);
> for (unsigned c = 0; c < rhs.size2(); c++)
> {
> matrix_column<matrix<double> > out(rv, c);
> matrix_row<matrix<double> > in(r, c);
> out = prod(lhs, in);
> }
> return rv;
> }
>
>
> Is there anybody working on improving the matrix multiplication
> performance?
>
> If not, then I can try to find some spare cycles ...
>
> Cheers,
>
> Imre Palik
> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas> Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


Hi Palik and Oswin,
I attached a simple implementation of a matrixmatrix product (gemm.hpp) which has less then 250 lines of code. I also attached a simple test programm ( matprod.cc). The test Programm has a simple wrapper for the implementation in gemm.hpp. For a real applications you should use traits like for the BLAS bindings. You can compile it with
g++ O3 Wall std=c++11 I $BOOST_PATH matprod.cc
where BOOST_PATH contains your boost implementation. Let me know if this works for you.
Cheers,
Michael
 Dr. Michael Lehn University of Ulm, Institute for Numerical Mathematics Helmholtzstr. 20 D89069 Ulm, Germany Phone: (+49) 731 5023534, Fax: (+49) 731 5023548 
Hi Palik, this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a nonproblem in the future. On 20160117 21:23, palik imre wrote: Hi all, It seems that the matrix multiplication in ublas ends up with the trivial algorithm. On my machine, even the following function outperforms it for square matrices bigger than 173*173 (by a huge margin for matrices bigger than 190*190), while not performing considerably worse for smaller matrices: matrix<double> matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs) { assert(lhs.size2() == rhs.size1()); matrix<double> rv(lhs.size1(), rhs.size2()); matrix<double> r = trans(rhs); for (unsigned c = 0; c < rhs.size2(); c++) { matrix_column<matrix<double> > out(rv, c); matrix_row<matrix<double> > in(r, c); out = prod(lhs, in); } return rv; } Is there anybody working on improving the matrix multiplication performance? If not, then I can try to find some spare cycles ... Cheers, Imre Palik _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email]
_______________________________________________ ublas mailing list [hidden email]http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


Is there a public git repo for ublas 2.0?
On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote:
Hi Palik, this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a nonproblem in the future. On 20160117 21:23, palik imre wrote: > Hi all, > > It seems that the matrix multiplication in ublas ends up with the > trivial algorithm. On my machine, even the following function > outperforms it for square matrices bigger than 173*173 (by a huge > margin for matrices bigger than 190*190), while not performing > considerably worse for smaller matrices: > > matrix<double> > matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs) > { > assert(lhs.size2() == rhs.size1()); > matrix<double> rv(lhs.size1(), rhs.size2()); > matrix<double> r = trans(rhs); > for (unsigned c = 0; c < rhs.size2(); c++) > { > matrix_column<matrix<double> > out(rv, c); > matrix_row<matrix<double> > in(r, c); > out = prod(lhs, in); > } > return rv; > } > > > Is there anybody working on improving the matrix multiplication > performance? > > If not, then I can try to find some spare cycles ... > > Cheers, > > Imre Palik
> _______________________________________________ > ublas mailing list > [hidden email]> http://lists.boost.org/mailman/listinfo.cgi/ublas> Sent to: [hidden email] _______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


Hi Michael,
I cannot see any attachments ...
On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote:
Is there a public git repo for ublas 2.0?
On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote:
Hi Palik, this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a nonproblem in the future. On 20160117 21:23, palik imre wrote: > Hi all, > > It seems that the matrix multiplication in ublas ends up with the > trivial algorithm. On my machine, even the following function > outperforms it for square matrices bigger than 173*173 (by a huge > margin for matrices bigger than 190*190), while not performing > considerably worse for smaller matrices: > > matrix<double> > matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs) > { > assert(lhs.size2() == rhs.size1()); > matrix<double> rv(lhs.size1(), rhs.size2()); > matrix<double> r = trans(rhs); > for (unsigned c = 0; c < rhs.size2(); c++) > { > matrix_column<matrix<double> > out(rv, c); > matrix_row<matrix<double> > in(r, c); > out = prod(lhs, in); > } > return rv; > } > > > Is there anybody working on improving the matrix multiplication > performance? > > If not, then I can try to find some spare cycles ... > > Cheers, > > Imre Palik
> _______________________________________________ > ublas mailing list > [hidden email]> http://lists.boost.org/mailman/listinfo.cgi/ublas> Sent to: [hidden email] _______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


Sorry for the inconvenience. I guess sending attachments to the mailing list is prohibited. I put the code on a website:
http://apfel.mathematik.uniulm.de/~lehn/test_ublas/
Hi Michael,
I cannot see any attachments ...
On Tuesday, 19 January 2016, 11:12, palik imre <[hidden email]> wrote:
Is there a public git repo for ublas 2.0?
On Monday, 18 January 2016, 9:25, Oswin Krause <[hidden email]> wrote:
Hi Palik, this is a known problem. In your case you should already get better performance when using axpy_prod instead of prod. There are currently moves towards a ublas 2.0 which should make this a nonproblem in the future. On 20160117 21:23, palik imre wrote: > Hi all, > > It seems that the matrix multiplication in ublas ends up with the > trivial algorithm. On my machine, even the following function > outperforms it for square matrices bigger than 173*173 (by a huge > margin for matrices bigger than 190*190), while not performing > considerably worse for smaller matrices: > > matrix<double> > matmul_byrow(const matrix<double> &lhs, const matrix<double> &rhs) > { > assert(lhs.size2() == rhs.size1()); > matrix<double> rv(lhs.size1(), rhs.size2()); > matrix<double> r = trans(rhs); > for (unsigned c = 0; c < rhs.size2(); c++) > { > matrix_column<matrix<double> > out(rv, c); > matrix_row<matrix<double> > in(r, c); > out = prod(lhs, in); > } > return rv; > } > > > Is there anybody working on improving the matrix multiplication > performance? > > If not, then I can try to find some spare cycles ... > > Cheers, > > Imre Palik
> _______________________________________________ > ublas mailing list > [hidden email]> http://lists.boost.org/mailman/listinfo.cgi/ublas> Sent to: [hidden email] _______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


Hello all,
The algorithm in this branch is aiming at cascading kernels and is
by no means complete. Each kernel represents various sizes of the
block matrix used in the inner loop of the multiplication. The
multiplication is performed by picking the kernel that would provide
the highest performance for certain matrices dimensions ( I think
this has not been pushed in this branch).
It is considerably faster (last time I checked) from typical block
algorithms that are using fixed size block sizes, especially for
nonsquare cases. In square cases it is more or less the same as
typical  explicit vectorization  block based  algorithms. This
applies when comparing to most linalg libraries out there. It is
also trivially paralellizable.
There are some issues in fully bringing this into ublas, but they
can be probably dealt with. Here are a few:
1. Different CPU architectures require different block sizes and
different kernel cascade architectures. Hence there is the need for
a facility to define these at compile time.
2. Statically cascading the kernels end into large compilation times
and large executables. I haven't performed any benchmarks using
dynamic block sizes to avoid these issues.
3. I haven't been able to find a systematic way of figuring out the
cascade, but to run benchmark tests on each CPU architecture and
create an ordered list. These benchmarks indicate that there are
better block size choices than the ones that can be computed by the
rational in
( http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).
4. I never run (3) to build a cascade using different number of
threads on SMP machines.
5. There is the need for proper introspection so that the product
operates only on ublas::matrix(es) and not generally on
matrix_expression(s)
I am afraid I don't have time to fully realize the implementation at
the moment, especially on how to deal elegantly with (1) and (2).
As a first start Michael Lehn's algorithm could be implemented (and
it is much much faster than the current simplistic implementation).
Michael: Have you checked the performance against Blaze for a few
nonsquare matrix multiplications?
 Nasos
On 01/20/2016 03:46 PM, Michael Lehn
wrote:
So has anyone compared the performance yet?
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


Hi Nasos,
first of all I don’t want to take wrong credits and want to point out that this is not my algorithm. It is based on
For a few cores (48) it can easily made multithreaded. For manycores like Intel Xeon Phi this is a bit more sophisticated but still not too hard. The demo I posted does not use micro kernels that exploit SSE, AVX or FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote my own microkernels but the interface of function ugemm is compatible to BLIS.
Maybe you could help me to integrate your code in the benchmark example I posted above.
About Blaze: Do they have their own implementation of a matrixmatrix product? It seems to require a tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrixmatrix product. IMHO they only have tuned the “easy” stuff like BLAS Level1 and Level2. In that case it makes more sense to compare the performance with the actual underlying GEMM implementation. But if I am wrong, let me know.
About the block size: In my experience you get better performance if you chose them dynamically at runtime depending on the problem size. Depending on the architecture you can just specify ranges like 256  384 for blocking factor MC. In my code it then also needs to satisfy the restriction that it can be divided by factor MR. I know that doing things at compile time is a C++ fetish. But the runtime overhead is negligible and having blocks of similar sizes easily pays of.
Cheers,
Michael
Hello all,
The algorithm in this branch is aiming at cascading kernels and is
by no means complete. Each kernel represents various sizes of the
block matrix used in the inner loop of the multiplication. The
multiplication is performed by picking the kernel that would provide
the highest performance for certain matrices dimensions ( I think
this has not been pushed in this branch).
It is considerably faster (last time I checked) from typical block
algorithms that are using fixed size block sizes, especially for
nonsquare cases. In square cases it is more or less the same as
typical  explicit vectorization  block based  algorithms. This
applies when comparing to most linalg libraries out there. It is
also trivially paralellizable.
There are some issues in fully bringing this into ublas, but they
can be probably dealt with. Here are a few:
1. Different CPU architectures require different block sizes and
different kernel cascade architectures. Hence there is the need for
a facility to define these at compile time.
2. Statically cascading the kernels end into large compilation times
and large executables. I haven't performed any benchmarks using
dynamic block sizes to avoid these issues.
3. I haven't been able to find a systematic way of figuring out the
cascade, but to run benchmark tests on each CPU architecture and
create an ordered list. These benchmarks indicate that there are
better block size choices than the ones that can be computed by the
rational in
( http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).
4. I never run (3) to build a cascade using different number of
threads on SMP machines.
5. There is the need for proper introspection so that the product
operates only on ublas::matrix(es) and not generally on
matrix_expression(s)
I am afraid I don't have time to fully realize the implementation at
the moment, especially on how to deal elegantly with (1) and (2).
As a first start Michael Lehn's algorithm could be implemented (and
it is much much faster than the current simplistic implementation).
Michael: Have you checked the performance against Blaze for a few
nonsquare matrix multiplications?
 Nasos
On 01/20/2016 03:46 PM, Michael Lehn
wrote:
So has anyone compared the performance yet?
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________ ublas mailing list [hidden email] http://lists.boost.org/mailman/listinfo.cgi/ublas Sent to: [hidden email] _______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


Michael,
please see below
On 01/21/2016 05:23 PM, Michael Lehn
wrote:
Hi Nasos,
first of all I don’t want to take wrong credits and want to
point out that this is not my algorithm. It is based on
For a few cores (48) it can easily made multithreaded. For
manycores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.
Setting up Phis is indeed an issue, especially because they are
"locked" with icpc. Openmp is working properly though.
The demo I posted does not use micro kernels that exploit
SSE, AVX or
FMA instructions. With that the matrix product is on par
with Intel MKL. Just like BLIS. For my platforms I wrote
my own microkernels but the interface of function ugemm is
compatible to BLIS.
If you compile with O3 I think you are getting near optimal SSE
vectorization. gcc is truly impressive and intel is even more.
Maybe you could help me to integrate your code in the
benchmark example I posted above.
I will try to find some time to spend on the code.
About Blaze: Do they have their own implementation of a
matrixmatrix product? It seems to require a
tuned BLAS implementation (“Otherwise you get only poor
performance”) for the matrixmatrix product.
I will check the benchmarks I run. I think I was using MKL with
Blaze, but Blaze is taking it a step further (I am not sure how) and
they are getting better performance than the underlying GEMM. Their
benchmarks indicate that they are faster than MKL
( https://bitbucket.org/blazelib/blaze/wiki/Benchmarks#!rowmajormatrixmatrixmultiplication)
IMHO they only have tuned the “easy” stuff like BLAS Level1
and Level2. In that case it makes more
sense to compare the performance with the actual underlying
GEMM implementation. But if I am wrong,
let me know.
About the block size: In my experience you get better
performance if you chose them dynamically at runtime
depending on the problem size. Depending on the architecture
you can just specify ranges like 256  384 for
blocking factor MC. In my code it then also needs to satisfy
the restriction that it can be divided by factor MR.
I know that doing things at compile time is a C++ fetish.
But the runtime overhead is negligible and having
blocks of similar sizes easily pays of.
Cheers,
Michael
Hello all,
The algorithm in this branch is aiming at cascading
kernels and is by no means complete. Each kernel
represents various sizes of the block matrix used in the
inner loop of the multiplication. The multiplication is
performed by picking the kernel that would provide the
highest performance for certain matrices dimensions ( I
think this has not been pushed in this branch).
It is considerably faster (last time I checked) from
typical block algorithms that are using fixed size block
sizes, especially for nonsquare cases. In square cases it
is more or less the same as typical  explicit
vectorization  block based  algorithms. This applies
when comparing to most linalg libraries out there. It is
also trivially paralellizable.
There are some issues in fully bringing this into ublas,
but they can be probably dealt with. Here are a few:
1. Different CPU architectures require different block
sizes and different kernel cascade architectures. Hence
there is the need for a facility to define these at
compile time.
2. Statically cascading the kernels end into large
compilation times and large executables. I haven't
performed any benchmarks using dynamic block sizes to
avoid these issues.
3. I haven't been able to find a systematic way of
figuring out the cascade, but to run benchmark tests on
each CPU architecture and create an ordered list. These
benchmarks indicate that there are better block size
choices than the ones that can be computed by the rational
in ( http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf).
4. I never run (3) to build a cascade using different
number of threads on SMP machines.
5. There is the need for proper introspection so that the
product operates only on ublas::matrix(es) and not
generally on matrix_expression(s)
I am afraid I don't have time to fully realize the
implementation at the moment, especially on how to deal
elegantly with (1) and (2).
As a first start Michael Lehn's algorithm could be
implemented (and it is much much faster than the current
simplistic implementation).
Michael: Have you checked the performance against Blaze
for a few nonsquare matrix multiplications?
 Nasos
On 01/20/2016 03:46 PM,
Michael Lehn wrote:
So has anyone compared the performance yet?
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


Michael, please see below On 01/21/2016 05:23 PM, Michael Lehn wrote:
Hi Nasos,
first of all I don’t want to take wrong credits and want to point out that this is not my algorithm. It is based on
For a few cores (48) it can easily made multithreaded. For manycores like Intel Xeon Phi this is a bit more sophisticated but still not too hard.
Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though. The demo I posted does not use micro kernels that exploit SSE, AVX or FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote my own microkernels but the interface of function ugemm is compatible to BLIS.
If you compile with O3 I think you are getting near optimal SSE vectorization. gcc is truly impressive and intel is even more.
No, believe me. No chance to beat asm :) _______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


Michael, please see below On 01/21/2016 05:23 PM, Michael Lehn wrote:
Hi Nasos,
first of all I don’t want to take wrong credits and want to point out that this is not my algorithm. It is based on
For a few cores (48) it can easily made multithreaded. For manycores like Intel Xeon Phi this is a bit more sophisticated but still not too hard.
Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though. The demo I posted does not use micro kernels that exploit SSE, AVX or FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote my own microkernels but the interface of function ugemm is compatible to BLIS.
If you compile with O3 I think you are getting near optimal SSE vectorization. gcc is truly impressive and intel is even more.
No, believe me. No chance to beat asm :)
Have a look here
for architectures with AVX or FMA its even more impressive. The asm micro kernels are more then just exploiting registers. If you really want to achieve peak performance you have to do actually math on asmlevel. I can extend the page with SSE, AVX and FMA micro kernels tomorrow. The performance boost is significant. Something like factor 3 for SSE and factor 5 for FMA. on a Intel i5 the GEMM performance:
 "Blocked Session 8" is basically what I posted here  “Blocked Session 8 + AVX” replaced “ugemm” with a asm implementation
But anyway. I think we need a common base for doing benchmarks, so you and others can convince yourself on your own hardware.
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


About Blaze: Do they have their own implementation of a
matrixmatrix product? It seems to require a
tuned BLAS implementation (“Otherwise you get only poor
performance”) for the matrixmatrix product.
I will check the benchmarks I run. I think I was using MKL with
Blaze, but Blaze is taking it a step further (I am not sure how) and
they are getting better performance than the underlying GEMM. Their
benchmarks indicate that they are faster than MKL
( https://bitbucket.org/blazelib/blaze/wiki/Benchmarks#!rowmajormatrixmatrixmultiplication)
They use a log scale for the benchmarks. IMHO that does not make any sense. On this benchmark they only are better for matrix dimensions smaller than 100. Even if you run the same implementations twice you get fluctuations of that magnitude. At 1000 its identical. And outside of the C++ word I never saw logscales for MFLOPS benchmarks. It makes sense if you compare the runtime of a O(N^k) algorithm. But I don’t see the point for illustrating performance. All this started with the BTL (Benchmark Template Library).
But I will look into the BLAZE code to make sure (as in proof). _______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


Hi Michael,
as for the block size, you could choose it somewhat better.
Both MR and NR should be at least the size of a L1 cacheline (i.e., 16 for double and 32 for float on Intel), moreover you should try to align your blocks to L1 cacheline (e.g. using posix_memalign(), or the cache aligned allocator from TBB)
This alone brought something like a 50% increase in performance in my tests. This also means that the original implementation stops being competitive around at around 50*50 squares, as opposed to 75*75.
I have some ideas about the other parameters too + I will plan to have a look with perf during the weekend.
Dynamic vs. static block size: A wild guess would be that dynamic probably worths to do above a different size. Should measure though.
Cheers,
Imre
From: Michael Lehn < [hidden email]> To: ublas mailing list < [hidden email]> Subject: Re: [ublas] Matrix multiplication performance MessageID: < [hidden email]> ContentType: text/plain; charset="windows1252" Hi Nasos, first of all I don?t want to take wrong credits and want to point out that this is not my algorithm. It is based on http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf https://github.com/flame/blisFor a few cores (48) it can easily made multithreaded. For manycores like Intel Xeon Phi this is a bit more sophisticated but still not too hard. The demo I posted does not use micro kernels that exploit SSE, AVX or FMA instructions. With that the matrix product is on par with Intel MKL. Just like BLIS. For my platforms I wrote my own microkernels but the interface of function ugemm is compatible to BLIS. Maybe you could help me to integrate your code in the benchmark example I posted above. About Blaze: Do they have their own implementation of a matrixmatrix product? It seems to require a tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrixmatrix product. IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2. In that case it makes more sense to compare the performance with the actual underlying GEMM implementation. But if I am wrong, let me know. About the block size: In my experience you get better performance if you chose them dynamically at runtime depending on the problem size. Depending on the architecture you can just specify ranges like 256  384 for blocking factor MC. In my code it then also needs to satisfy the restriction that it can be divided by factor MR. I know that doing things at compile time is a C++ fetish. But the runtime overhead is negligible and having blocks of similar sizes easily pays of. Cheers, Michael _______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


Michael,
please see below
On 01/21/2016 05:23 PM, Michael Lehn
wrote:
Hi Nasos,
first of all I don’t want to take wrong credits and want to
point out that this is not my algorithm. It is based on
For a few cores (48) it can easily made multithreaded. For
manycores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.
Setting up Phis is indeed an issue, especially because they are
"locked" with icpc. Openmp is working properly though.
The demo I posted does not use micro kernels that exploit
SSE, AVX or
FMA instructions. With that the matrix product is on par
with Intel MKL. Just like BLIS. For my platforms I wrote
my own microkernels but the interface of function ugemm is
compatible to BLIS.
If you compile with O3 I think you are getting near optimal SSE
vectorization. gcc is truly impressive and intel is even more.
Maybe you could help me to integrate your code in the
benchmark example I posted above.
I will try to find some time to spend on the code.
About Blaze: Do they have their own implementation of a
matrixmatrix product? It seems to require a
tuned BLAS implementation (“Otherwise you get only poor
performance”) for the matrixmatrix product.
I will check the benchmarks I run. I think I was using MKL with
Blaze, but Blaze is taking it a step further (I am not sure how) and
they are getting better performance than the underlying GEMM. Their
benchmarks indicate that they are faster than MKL
( https://bitbucket.org/blazelib/blaze/wiki/Benchmarks#!rowmajormatrixmatrixmultiplication)
I started today with similar experiments on BLAZE and had closer look at their internal implementation. By default they are calling an external BLAS backend. On my machine I used the Intel MKL. But you are right, they also have an internal implementation that can be used if no external BLAS is available. I will publish the results on this page:
At the moment the benchmarks for the internal BLAZE implementation for the matrixmatrix product seem to look poor. I asked Klaus Iglberger (the author of BLAZE) to check the compiler flags that I have used. So don’t take the current results asis.
Cheers,
Michael
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]


On 16 Feb 2016, at 17:09, Michael Lehn < [hidden email]> wrote:
>>>
>>> About Blaze: Do they have their own implementation of a matrixmatrix product? It seems to require a
>>> tuned BLAS implementation (“Otherwise you get only poor performance”) for the matrixmatrix product.
>> I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL ( https://bitbucket.org/blazelib/blaze/wiki/Benchmarks#!rowmajormatrixmatrixmultiplication)
>
> I started today with similar experiments on BLAZE and had closer look at their internal implementation. By default
> they are calling an external BLAS backend. On my machine I used the Intel MKL. But you are right, they also have
> an internal implementation that can be used if no external BLAS is available. I will publish the results on this page:
>
> http://www.mathematik.uniulm.de/~lehn/test_blaze/index.html>
> At the moment the benchmarks for the internal BLAZE implementation for the matrixmatrix product seem to look
> poor. I asked Klaus Iglberger (the author of BLAZE) to check the compiler flags that I have used. So don’t take the
> current results asis.
>
I modified the benchmark code in favour for BLAZE as Klaus Iglberger’s suggested. The results are now slightly
better. However, compared to Intel MKL they are way off. Similar results I got on (my) Haswell. The problem seems
as stated by Klaus Iglberger: "The optimization flags are fine. The performance difference is due to the lack of a
specifically optimized kernel.” So on other platforms it might perform much better.

Dr. Michael Lehn
University of Ulm, Institute for Numerical Mathematics
Helmholtzstr. 20
D89069 Ulm, Germany
Phone: (+49) 731 5023534, Fax: (+49) 731 5023548

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublasSent to: [hidden email]

