Re: ublas Digest, Vol 130, Issue 8

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: ublas Digest, Vol 130, Issue 8

palik imre

vectorisation: -Ofast vectorises even better.  (Though the results might be off somewhat)
BTW, this is one of the big issues with the vanila ublas implementation.  I mean that it doesn't vectorise.

On Friday, 22 January 2016, 0:30, "[hidden email]" <[hidden email]> wrote:


From: nasos <[hidden email]>
To: [hidden email]
Subject: Re: [ublas] Matrix multiplication performance
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="windows-1252"; Format="flowed"

Michael,
please see below

On 01/21/2016 05:23 PM, Michael Lehn wrote:

> Hi Nasos,
>
> first of all I don?t want to take wrong credits and want to point out
> that this is not my algorithm.  It is based on
>
> http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>
> https://github.com/flame/blis
>
> For a few cores (4-8) it can easily made multithreaded.  For
> many-cores like Intel Xeon Phi this is a bit more
> sophisticated but still not too hard.
Setting up Phis is indeed an issue, especially because they are "locked"
with icpc. Openmp is working properly though.

> The demo I posted does not use micro kernels that exploit SSE, AVX or
> FMA instructions.  With that the matrix product is on par with Intel
> MKL.  Just like BLIS. For my platforms I wrote
> my own micro-kernels but the interface of function ugemm is compatible
> to BLIS.
>
If you compile with -O3 I think you are getting  near optimal SSE
vectorization. gcc is truly impressive and intel is even more.



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]