Quantcast

Matrix multiplication performance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Matrix multiplication performance

palik imre
Hi Michael,

your blocksizes are far from optimal.  MR & NR should be multiples of the L1 cache line size (i.e. 16 for double on Intel).  Also, the blocks should be allocated aligned to L1 cache lines (e.g., via posix_memalign()).

This alone brought something like 50% speedup for my square matrix test.

I will have a look at the other parameters + the whole thing via perf during the weekend.

Cheers,

Imre



On Friday, 22 January 2016, 0:28, "[hidden email]" <[hidden email]> wrote:


Subject: Re: [ublas] Matrix multiplication performance
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="windows-1252"

Hi Nasos,

first of all I don?t want to take wrong credits and want to point out that this is not my algorithm.  It is based on

    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf

    https://github.com/flame/blis

For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.  The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

Maybe you could help me to integrate your code in the benchmark example I posted above.

About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.
IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2.  In that case it makes more
sense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,
let me know.

About the block size: In my experience you get better performance if you chose them dynamically at runtime
depending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 for
blocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and having
blocks of similar sizes easily pays of.

Cheers,

Michael
*************************************



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

palik imre
Sorry for posting twice more or less the same thing.  I got confused with javascript interfaces.

It seems I also forgot to enable avx for my last measurements.  With that + my blocking and alignment changes, performance according to my tests is something like 250% higher than running Michael's original code (with avx).

Cheers,

Imre


On Friday, 22 January 2016, 10:33, palik imre <[hidden email]> wrote:


Hi Michael,

your blocksizes are far from optimal.  MR & NR should be multiples of the L1 cache line size (i.e. 16 for double on Intel).  Also, the blocks should be allocated aligned to L1 cache lines (e.g., via posix_memalign()).

This alone brought something like 50% speedup for my square matrix test.

I will have a look at the other parameters + the whole thing via perf during the weekend.

Cheers,

Imre



On Friday, 22 January 2016, 0:28, "[hidden email]" <[hidden email]> wrote:


Subject: Re: [ublas] Matrix multiplication performance
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="windows-1252"

Hi Nasos,

first of all I don?t want to take wrong credits and want to point out that this is not my algorithm.  It is based on

    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf

    https://github.com/flame/blis

For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.  The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

Maybe you could help me to integrate your code in the benchmark example I posted above.

About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.
IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2.  In that case it makes more
sense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,
let me know.

About the block size: In my experience you get better performance if you chose them dynamically at runtime
depending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 for
blocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and having
blocks of similar sizes easily pays of.

Cheers,

Michael
*************************************





_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Nasos Iliopoulos
Palik,
what was the full compile line you used? g++ -Ofast -mavx ...?

-Nasos


On 01/22/2016 05:03 AM, palik imre wrote:
Sorry for posting twice more or less the same thing.  I got confused with javascript interfaces.

It seems I also forgot to enable avx for my last measurements.  With that + my blocking and alignment changes, performance according to my tests is something like 250% higher than running Michael's original code (with avx).

Cheers,

Imre


On Friday, 22 January 2016, 10:33, palik imre [hidden email] wrote:


Hi Michael,

your blocksizes are far from optimal.  MR & NR should be multiples of the L1 cache line size (i.e. 16 for double on Intel).  Also, the blocks should be allocated aligned to L1 cache lines (e.g., via posix_memalign()).

This alone brought something like 50% speedup for my square matrix test.

I will have a look at the other parameters + the whole thing via perf during the weekend.

Cheers,

Imre



On Friday, 22 January 2016, 0:28, [hidden email] [hidden email] wrote:


Subject: Re: [ublas] Matrix multiplication performance
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="windows-1252"

Hi Nasos,

first of all I don?t want to take wrong credits and want to point out that this is not my algorithm.  It is based on

    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf

    https://github.com/flame/blis

For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.  The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

Maybe you could help me to integrate your code in the benchmark example I posted above.

About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.
IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2.  In that case it makes more
sense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,
let me know.

About the block size: In my experience you get better performance if you chose them dynamically at runtime
depending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 for
blocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and having
blocks of similar sizes easily pays of.

Cheers,

Michael
*************************************






_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Michael Lehn-2
In reply to this post by palik imre
Hi Imre,

thanks for running the benchmarks.  Of course you are right that using aligned memory for the buffers improves
performance.  I also did not really put any effort in optimising the parameters MC, NC, KC, MR and NR.  I will
compare different variants and report them on the website


I modified my benchmark program such that it also computes the FLOPS as

FLOPS = 2*m*n*k/time_elpased

See


Could you re-run your benchmarks and post the different MFLOPS you get? That is important for actually tuning thing.
On my machine my code only reaches 20% of the peak performance (about 5 GFLOPS instead of 25.6 GFLOPS).   So
a speedup of 2.5 would be impressive but still far from peak performance.

Cheers,

Michael


On 22 Jan 2016, at 11:03, palik imre <[hidden email]> wrote:

Sorry for posting twice more or less the same thing.  I got confused with javascript interfaces.

It seems I also forgot to enable avx for my last measurements.  With that + my blocking and alignment changes, performance according to my tests is something like 250% higher than running Michael's original code (with avx).

Cheers,

Imre


On Friday, 22 January 2016, 10:33, palik imre <[hidden email]> wrote:


Hi Michael,

your blocksizes are far from optimal.  MR & NR should be multiples of the L1 cache line size (i.e. 16 for double on Intel).  Also, the blocks should be allocated aligned to L1 cache lines (e.g., via posix_memalign()).

This alone brought something like 50% speedup for my square matrix test.

I will have a look at the other parameters + the whole thing via perf during the weekend.

Cheers,

Imre



On Friday, 22 January 2016, 0:28, "[hidden email]" <[hidden email]> wrote:


Subject: Re: [ublas] Matrix multiplication performance
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="windows-1252"

Hi Nasos,

first of all I don?t want to take wrong credits and want to point out that this is not my algorithm.  It is based on

    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf

    https://github.com/flame/blis

For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
sophisticated but still not too hard.  The demo I posted does not use micro kernels that exploit SSE, AVX or
FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
my own micro-kernels but the interface of function ugemm is compatible to BLIS.

Maybe you could help me to integrate your code in the benchmark example I posted above.

About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.
IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2.  In that case it makes more
sense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,
let me know.

About the block size: In my experience you get better performance if you chose them dynamically at runtime
depending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 for
blocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and having
blocks of similar sizes easily pays of.

Cheers,

Michael
*************************************




_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Michael Lehn-2
Wow Imre!

Ok, that is actually a significant difference :-)

I have just added a new version to my site.  Unfortunately the computer I used for creating
the page does not have posix_memalign().  So I had to hack my own function for that.  But
I think for a proof of concept it will do.

But most of all I also added micro-kernel that are (fairly) optimised for AVX and FMA.  The
micro-kernel for AVX require MR=4, NR=8.  For FMA it requires MR=4, NR=16.  Otherwise
the reference implementation gets selected.  For all parameters default values now can be
overwritten when compiling, e.g.

        g++ -O3 -Wall -std=c++11 -DHAVE_AVX  -DBS_D_MC=512  matprod.cc

The optimised micro kernels are only included when compiled with -DHAVE_AVX or -DHAVE_FMA

I put all this stuff here

        http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html

also the tar-ball

        http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz

contains all required files:

        session2/avx.hpp
        session2/fma.hpp
        session2/gemm.hpp
        session2/matprod.cc

Of course I would be interested how all this performs on other platforms.


Cheers,

Michael


On 22 Jan 2016, at 16:20, palik imre <[hidden email]> wrote:

> Hi All,
>
> In the meantime I enabled avx2 ...
>
> Theoretical CPU performance maximum: clock * vector-size * ALUs/Core * 2 ~ 69.6GFLOPS
> So it is at ~25%
>
> Compiler is gcc 4.8.3
>
> vanilla run:
> $ g++ -Wall -W -std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod matprod.cc
> $ ./matprod
> #   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2      MFLOPS        Diff nrm1
>   100   100   100  0.000350402      5707.73      0.00116104      1722.59               0
>   200   200   200   0.00445094      3594.74      0.00819996      1951.23               0
>   300   300   300    0.0138515      3898.49       0.0266515      2026.15     1.06987e-06
>   400   400   400    0.0266447      4803.96       0.0613506      2086.37     4.01475e-06
>   500   500   500    0.0424372      5891.06        0.119345      2094.77     8.99605e-06
>   600   600   600    0.0724648      5961.51        0.203187      2126.12     1.63618e-05
>   700   700   700     0.115464      5941.25        0.325834      2105.36     2.69547e-05
>   800   800   800     0.173003      5918.98        0.480655      2130.43     4.09449e-05
>   900   900   900     0.248077       5877.2        0.689972      2113.13     5.87376e-05
>  1000  1000  1000      0.33781      5920.49        0.930591      2149.17     8.16264e-05
>  1100  1100  1100       0.5149      5169.93         1.25507         2121     0.000108883
>  1200  1200  1200     0.670628      5153.38         1.62732      2123.74     0.000141876
>  1300  1300  1300     0.852692      5153.09         2.06708      2125.71     0.000180926
>  1400  1400  1400      1.06695      5143.65         2.56183      2142.22     0.000225975
>  1500  1500  1500       1.3874       4865.2         3.16532      2132.49     0.000278553
>  1600  1600  1600      1.77623      4612.03          3.8137      2148.05     0.000338106
>  1700  1700  1700       2.3773      4133.26         4.56665      2151.69     0.000404458
>  1800  1800  1800      3.06381      3807.03         5.40317      2158.73      0.00048119
>  1900  1900  1900       3.9039      3513.92         6.37295      2152.53     0.000564692
>  2000  2000  2000      4.79166      3339.13         7.43399      2152.28     0.000659714
>  2100  2100  2100      6.04946      3061.76         8.62429      2147.65     0.000762223
>  2200  2200  2200      7.39085       2881.4         9.86237      2159.32     0.000875624
>  2300  2300  2300      9.00453      2702.42         11.2513      2162.78      0.00100184
>  2400  2400  2400      10.3952      2659.68         12.7491      2168.62      0.00113563
>  2500  2500  2500      12.2283      2555.55         14.4615      2160.92      0.00128336
>  2600  2600  2600      13.8912      2530.51         16.1965      2170.34      0.00144304
>  2700  2700  2700       15.391      2557.72         18.1998      2162.99      0.00161411
>  2800  2800  2800      17.5673      2499.19         20.2171      2171.63      0.00180035
>  2900  2900  2900      19.4621      2506.31         22.5482      2163.28      0.00199765
>  3000  3000  3000      21.4506      2517.42         24.9477      2164.53      0.00221028
>  3100  3100  3100        23.71      2512.95         27.5144      2165.48      0.00243877
>  3200  3200  3200      25.9051      2529.85         30.2816      2164.22      0.00267766
>  3300  3300  3300      28.1949      2549.18          33.176      2166.45      0.00293379
>  3400  3400  3400      30.7235      2558.56         36.0156      2182.61       0.0032087
>  3500  3500  3500      34.0419      2518.95         39.3929      2176.79      0.00349827
>  3600  3600  3600      37.0562      2518.12         42.7524      2182.62      0.00380447
>  3700  3700  3700      39.7885      2546.11         46.4748      2179.81      0.00412621
>  3800  3800  3800      43.6607      2513.56         50.2119      2185.62       0.0044694
>  3900  3900  3900      46.5104      2550.78         54.4822      2177.56      0.00482355
>  4000  4000  4000      50.6098      2529.15         58.7686      2178.03      0.00520289
>
> tuned run:
>
> $ g++ -Wall -W -std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod2 matprod2.cc
> $ ./matprod2
> #   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2      MFLOPS        Diff nrm1
>   100   100   100  0.000351671      5687.13     0.000316612      6316.88               0
>   200   200   200   0.00419531      3813.78      0.00159044      10060.1               0
>   300   300   300    0.0141153      3825.62      0.00421113      12823.2     1.07645e-06
>   400   400   400    0.0291599      4389.59      0.00858138        14916     4.00614e-06
>   500   500   500    0.0483492      5170.72       0.0166519      15013.3     8.96808e-06
>   600   600   600    0.0725783      5952.19       0.0279634      15448.7     1.63386e-05
>   700   700   700     0.113891      6023.29        0.043077        15925     2.69191e-05
>   800   800   800     0.171416      5973.79       0.0627796        16311     4.09782e-05
>   900   900   900     0.243677      5983.32       0.0922766      15800.3     5.88092e-05
>  1000  1000  1000     0.335158      5967.33        0.123339      16215.5     8.15988e-05
>  1100  1100  1100     0.515776      5161.15         0.16578      16057.5     0.000108991
>  1200  1200  1200     0.662706      5214.98        0.205989      16777.6     0.000141824
>  1300  1300  1300     0.845952      5194.15         0.27637        15899      0.00018111
>  1400  1400  1400      1.06712      5142.82        0.332118      16524.2     0.000225958
>  1500  1500  1500      1.38147      4886.11        0.409224      16494.6     0.000278265
>  1600  1600  1600      1.72238      4756.21        0.492314      16639.8     0.000338095
>  1700  1700  1700      2.38508      4119.77        0.603566      16279.9     0.000404362
>  1800  1800  1800      3.12034      3738.05        0.717409      16258.5     0.000481575
>  1900  1900  1900      3.93668      3484.66        0.824933      16629.2     0.000564727
>  2000  2000  2000      4.76038      3361.07        0.941643      16991.6      0.00065862
>  2100  2100  2100      5.90627      3135.99         1.12226      16504.2     0.000762307
>  2200  2200  2200      7.26419      2931.64         1.28213      16609.9     0.000876699
>  2300  2300  2300      8.88171      2739.79         1.45247      16753.5      0.00100222
>  2400  2400  2400      10.4956      2634.26         1.62705      16992.7      0.00113566
>  2500  2500  2500       11.913      2623.18         1.87499      16666.7      0.00128371
>  2600  2600  2600      13.7057      2564.77          2.1156      16615.6      0.00144259
>  2700  2700  2700      15.5959      2524.13         2.33957      16826.1      0.00161501
>  2800  2800  2800      17.1121      2565.67         2.57445      17053.8      0.00179901
>  2900  2900  2900      19.4167      2512.16         2.92445      16679.4      0.00199764
>  3000  3000  3000      21.3239      2532.37         3.18891      16933.7      0.00220999
>  3100  3100  3100      23.5049      2534.88          3.5305      16876.4      0.00243845
>  3200  3200  3200      25.7362      2546.45         3.81708      17169.1      0.00267581
>  3300  3300  3300      28.4467      2526.62         4.25869        16877      0.00293513
>  3400  3400  3400      30.4607      2580.63         4.67999      16796.6      0.00320688
>  3500  3500  3500      33.7737      2538.96         5.04289      17004.1      0.00349667
>  3600  3600  3600      36.9633      2524.45           5.414      17235.3      0.00380237
>  3700  3700  3700      39.5153      2563.71         6.04875      16748.2      0.00412583
>  3800  3800  3800      42.9412      2555.68         6.48985      16910.1      0.00446785
>  3900  3900  3900      46.5282      2549.81         7.05844        16808      0.00482701
>  4000  4000  4000      50.2218      2548.69         7.42442      17240.4      0.00520272
>
> As the generated machine code is completely different, I guess gcc notices the aligned alloc, and uses the alignment information for optimisation.
>
> Cheers,
>
> Imre
>
>
> On Friday, 22 January 2016, 15:09, Michael Lehn <[hidden email]> wrote:
>
>
> Hi Imre,
>
> thanks for running the benchmarks.  Of course you are right that using aligned memory for the buffers improves
> performance.  I also did not really put any effort in optimising the parameters MC, NC, KC, MR and NR.  I will
> compare different variants and report them on the website
>
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
>
> I modified my benchmark program such that it also computes the FLOPS as
>
> FLOPS = 2*m*n*k/time_elpased
>
> See
>
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
>
> Could you re-run your benchmarks and post the different MFLOPS you get? That is important for actually tuning thing.
> On my machine my code only reaches 20% of the peak performance (about 5 GFLOPS instead of 25.6 GFLOPS).   So
> a speedup of 2.5 would be impressive but still far from peak performance.
>
> Cheers,
>
> Michael
>
>
> On 22 Jan 2016, at 11:03, palik imre <[hidden email]> wrote:
>
>> Sorry for posting twice more or less the same thing.  I got confused with javascript interfaces.
>>
>> It seems I also forgot to enable avx for my last measurements.  With that + my blocking and alignment changes, performance according to my tests is something like 250% higher than running Michael's original code (with avx).
>>
>> Cheers,
>>
>> Imre
>>
>>
>> On Friday, 22 January 2016, 10:33, palik imre <[hidden email]> wrote:
>>
>>
>> Hi Michael,
>>
>> your blocksizes are far from optimal.  MR & NR should be multiples of the L1 cache line size (i.e. 16 for double on Intel).  Also, the blocks should be allocated aligned to L1 cache lines (e.g., via posix_memalign()).
>>
>> This alone brought something like 50% speedup for my square matrix test.
>>
>> I will have a look at the other parameters + the whole thing via perf during the weekend.
>>
>> Cheers,
>>
>> Imre
>>
>>
>>
>> On Friday, 22 January 2016, 0:28, "[hidden email]" <[hidden email]> wrote:
>>
>>
>> Subject: Re: [ublas] Matrix multiplication performance
>> Message-ID: <[hidden email]>
>> Content-Type: text/plain; charset="windows-1252"
>>
>> Hi Nasos,
>>
>> first of all I don?t want to take wrong credits and want to point out that this is not my algorithm.  It is based on
>>
>>     http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>
>>     https://github.com/flame/blis
>>
>> For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
>> sophisticated but still not too hard.  The demo I posted does not use micro kernels that exploit SSE, AVX or
>> FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
>> my own micro-kernels but the interface of function ugemm is compatible to BLIS.
>>
>> Maybe you could help me to integrate your code in the benchmark example I posted above.
>>
>> About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
>> tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.
>> IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2.  In that case it makes more
>> sense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,
>> let me know.
>>
>> About the block size: In my experience you get better performance if you chose them dynamically at runtime
>> depending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 for
>> blocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
>> I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and having
>> blocks of similar sizes easily pays of.
>>
>> Cheers,
>>
>> Michael
>> *************************************
>>
>>
>>
>>
>> _______________________________________________
>> ublas mailing list
>> [hidden email]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: [hidden email]
>
>
>

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Michael Lehn-2
I re-run all the benchmarks with “-Ofast -mavx” (as the hardware on with I generate the
benchmarks does not have AVX2).  So now the original ublas::axpy_prod() is actually
doing way better ...


On 22 Jan 2016, at 21:27, Michael Lehn <[hidden email]> wrote:

> Wow Imre!
>
> Ok, that is actually a significant difference :-)
>
> I have just added a new version to my site.  Unfortunately the computer I used for creating
> the page does not have posix_memalign().  So I had to hack my own function for that.  But
> I think for a proof of concept it will do.
>
> But most of all I also added micro-kernel that are (fairly) optimised for AVX and FMA.  The
> micro-kernel for AVX require MR=4, NR=8.  For FMA it requires MR=4, NR=16.  Otherwise
> the reference implementation gets selected.  For all parameters default values now can be
> overwritten when compiling, e.g.
>
> g++ -O3 -Wall -std=c++11 -DHAVE_AVX  -DBS_D_MC=512  matprod.cc
>
> The optimised micro kernels are only included when compiled with -DHAVE_AVX or -DHAVE_FMA
>
> I put all this stuff here
>
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html
>
> also the tar-ball
>
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz
>
> contains all required files:
>
> session2/avx.hpp
> session2/fma.hpp
> session2/gemm.hpp
> session2/matprod.cc
>
> Of course I would be interested how all this performs on other platforms.
>
>
> Cheers,
>
> Michael
>
>
> On 22 Jan 2016, at 16:20, palik imre <[hidden email]> wrote:
>
>> Hi All,
>>
>> In the meantime I enabled avx2 ...
>>
>> Theoretical CPU performance maximum: clock * vector-size * ALUs/Core * 2 ~ 69.6GFLOPS
>> So it is at ~25%
>>
>> Compiler is gcc 4.8.3
>>
>> vanilla run:
>> $ g++ -Wall -W -std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod matprod.cc
>> $ ./matprod
>> #   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2      MFLOPS        Diff nrm1
>>  100   100   100  0.000350402      5707.73      0.00116104      1722.59               0
>>  200   200   200   0.00445094      3594.74      0.00819996      1951.23               0
>>  300   300   300    0.0138515      3898.49       0.0266515      2026.15     1.06987e-06
>>  400   400   400    0.0266447      4803.96       0.0613506      2086.37     4.01475e-06
>>  500   500   500    0.0424372      5891.06        0.119345      2094.77     8.99605e-06
>>  600   600   600    0.0724648      5961.51        0.203187      2126.12     1.63618e-05
>>  700   700   700     0.115464      5941.25        0.325834      2105.36     2.69547e-05
>>  800   800   800     0.173003      5918.98        0.480655      2130.43     4.09449e-05
>>  900   900   900     0.248077       5877.2        0.689972      2113.13     5.87376e-05
>> 1000  1000  1000      0.33781      5920.49        0.930591      2149.17     8.16264e-05
>> 1100  1100  1100       0.5149      5169.93         1.25507         2121     0.000108883
>> 1200  1200  1200     0.670628      5153.38         1.62732      2123.74     0.000141876
>> 1300  1300  1300     0.852692      5153.09         2.06708      2125.71     0.000180926
>> 1400  1400  1400      1.06695      5143.65         2.56183      2142.22     0.000225975
>> 1500  1500  1500       1.3874       4865.2         3.16532      2132.49     0.000278553
>> 1600  1600  1600      1.77623      4612.03          3.8137      2148.05     0.000338106
>> 1700  1700  1700       2.3773      4133.26         4.56665      2151.69     0.000404458
>> 1800  1800  1800      3.06381      3807.03         5.40317      2158.73      0.00048119
>> 1900  1900  1900       3.9039      3513.92         6.37295      2152.53     0.000564692
>> 2000  2000  2000      4.79166      3339.13         7.43399      2152.28     0.000659714
>> 2100  2100  2100      6.04946      3061.76         8.62429      2147.65     0.000762223
>> 2200  2200  2200      7.39085       2881.4         9.86237      2159.32     0.000875624
>> 2300  2300  2300      9.00453      2702.42         11.2513      2162.78      0.00100184
>> 2400  2400  2400      10.3952      2659.68         12.7491      2168.62      0.00113563
>> 2500  2500  2500      12.2283      2555.55         14.4615      2160.92      0.00128336
>> 2600  2600  2600      13.8912      2530.51         16.1965      2170.34      0.00144304
>> 2700  2700  2700       15.391      2557.72         18.1998      2162.99      0.00161411
>> 2800  2800  2800      17.5673      2499.19         20.2171      2171.63      0.00180035
>> 2900  2900  2900      19.4621      2506.31         22.5482      2163.28      0.00199765
>> 3000  3000  3000      21.4506      2517.42         24.9477      2164.53      0.00221028
>> 3100  3100  3100        23.71      2512.95         27.5144      2165.48      0.00243877
>> 3200  3200  3200      25.9051      2529.85         30.2816      2164.22      0.00267766
>> 3300  3300  3300      28.1949      2549.18          33.176      2166.45      0.00293379
>> 3400  3400  3400      30.7235      2558.56         36.0156      2182.61       0.0032087
>> 3500  3500  3500      34.0419      2518.95         39.3929      2176.79      0.00349827
>> 3600  3600  3600      37.0562      2518.12         42.7524      2182.62      0.00380447
>> 3700  3700  3700      39.7885      2546.11         46.4748      2179.81      0.00412621
>> 3800  3800  3800      43.6607      2513.56         50.2119      2185.62       0.0044694
>> 3900  3900  3900      46.5104      2550.78         54.4822      2177.56      0.00482355
>> 4000  4000  4000      50.6098      2529.15         58.7686      2178.03      0.00520289
>>
>> tuned run:
>>
>> $ g++ -Wall -W -std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod2 matprod2.cc
>> $ ./matprod2
>> #   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2      MFLOPS        Diff nrm1
>>  100   100   100  0.000351671      5687.13     0.000316612      6316.88               0
>>  200   200   200   0.00419531      3813.78      0.00159044      10060.1               0
>>  300   300   300    0.0141153      3825.62      0.00421113      12823.2     1.07645e-06
>>  400   400   400    0.0291599      4389.59      0.00858138        14916     4.00614e-06
>>  500   500   500    0.0483492      5170.72       0.0166519      15013.3     8.96808e-06
>>  600   600   600    0.0725783      5952.19       0.0279634      15448.7     1.63386e-05
>>  700   700   700     0.113891      6023.29        0.043077        15925     2.69191e-05
>>  800   800   800     0.171416      5973.79       0.0627796        16311     4.09782e-05
>>  900   900   900     0.243677      5983.32       0.0922766      15800.3     5.88092e-05
>> 1000  1000  1000     0.335158      5967.33        0.123339      16215.5     8.15988e-05
>> 1100  1100  1100     0.515776      5161.15         0.16578      16057.5     0.000108991
>> 1200  1200  1200     0.662706      5214.98        0.205989      16777.6     0.000141824
>> 1300  1300  1300     0.845952      5194.15         0.27637        15899      0.00018111
>> 1400  1400  1400      1.06712      5142.82        0.332118      16524.2     0.000225958
>> 1500  1500  1500      1.38147      4886.11        0.409224      16494.6     0.000278265
>> 1600  1600  1600      1.72238      4756.21        0.492314      16639.8     0.000338095
>> 1700  1700  1700      2.38508      4119.77        0.603566      16279.9     0.000404362
>> 1800  1800  1800      3.12034      3738.05        0.717409      16258.5     0.000481575
>> 1900  1900  1900      3.93668      3484.66        0.824933      16629.2     0.000564727
>> 2000  2000  2000      4.76038      3361.07        0.941643      16991.6      0.00065862
>> 2100  2100  2100      5.90627      3135.99         1.12226      16504.2     0.000762307
>> 2200  2200  2200      7.26419      2931.64         1.28213      16609.9     0.000876699
>> 2300  2300  2300      8.88171      2739.79         1.45247      16753.5      0.00100222
>> 2400  2400  2400      10.4956      2634.26         1.62705      16992.7      0.00113566
>> 2500  2500  2500       11.913      2623.18         1.87499      16666.7      0.00128371
>> 2600  2600  2600      13.7057      2564.77          2.1156      16615.6      0.00144259
>> 2700  2700  2700      15.5959      2524.13         2.33957      16826.1      0.00161501
>> 2800  2800  2800      17.1121      2565.67         2.57445      17053.8      0.00179901
>> 2900  2900  2900      19.4167      2512.16         2.92445      16679.4      0.00199764
>> 3000  3000  3000      21.3239      2532.37         3.18891      16933.7      0.00220999
>> 3100  3100  3100      23.5049      2534.88          3.5305      16876.4      0.00243845
>> 3200  3200  3200      25.7362      2546.45         3.81708      17169.1      0.00267581
>> 3300  3300  3300      28.4467      2526.62         4.25869        16877      0.00293513
>> 3400  3400  3400      30.4607      2580.63         4.67999      16796.6      0.00320688
>> 3500  3500  3500      33.7737      2538.96         5.04289      17004.1      0.00349667
>> 3600  3600  3600      36.9633      2524.45           5.414      17235.3      0.00380237
>> 3700  3700  3700      39.5153      2563.71         6.04875      16748.2      0.00412583
>> 3800  3800  3800      42.9412      2555.68         6.48985      16910.1      0.00446785
>> 3900  3900  3900      46.5282      2549.81         7.05844        16808      0.00482701
>> 4000  4000  4000      50.2218      2548.69         7.42442      17240.4      0.00520272
>>
>> As the generated machine code is completely different, I guess gcc notices the aligned alloc, and uses the alignment information for optimisation.
>>
>> Cheers,
>>
>> Imre
>>
>>
>> On Friday, 22 January 2016, 15:09, Michael Lehn <[hidden email]> wrote:
>>
>>
>> Hi Imre,
>>
>> thanks for running the benchmarks.  Of course you are right that using aligned memory for the buffers improves
>> performance.  I also did not really put any effort in optimising the parameters MC, NC, KC, MR and NR.  I will
>> compare different variants and report them on the website
>>
>> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
>>
>> I modified my benchmark program such that it also computes the FLOPS as
>>
>> FLOPS = 2*m*n*k/time_elpased
>>
>> See
>>
>> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
>>
>> Could you re-run your benchmarks and post the different MFLOPS you get? That is important for actually tuning thing.
>> On my machine my code only reaches 20% of the peak performance (about 5 GFLOPS instead of 25.6 GFLOPS).   So
>> a speedup of 2.5 would be impressive but still far from peak performance.
>>
>> Cheers,
>>
>> Michael
>>
>>
>> On 22 Jan 2016, at 11:03, palik imre <[hidden email]> wrote:
>>
>>> Sorry for posting twice more or less the same thing.  I got confused with javascript interfaces.
>>>
>>> It seems I also forgot to enable avx for my last measurements.  With that + my blocking and alignment changes, performance according to my tests is something like 250% higher than running Michael's original code (with avx).
>>>
>>> Cheers,
>>>
>>> Imre
>>>
>>>
>>> On Friday, 22 January 2016, 10:33, palik imre <[hidden email]> wrote:
>>>
>>>
>>> Hi Michael,
>>>
>>> your blocksizes are far from optimal.  MR & NR should be multiples of the L1 cache line size (i.e. 16 for double on Intel).  Also, the blocks should be allocated aligned to L1 cache lines (e.g., via posix_memalign()).
>>>
>>> This alone brought something like 50% speedup for my square matrix test.
>>>
>>> I will have a look at the other parameters + the whole thing via perf during the weekend.
>>>
>>> Cheers,
>>>
>>> Imre
>>>
>>>
>>>
>>> On Friday, 22 January 2016, 0:28, "[hidden email]" <[hidden email]> wrote:
>>>
>>>
>>> Subject: Re: [ublas] Matrix multiplication performance
>>> Message-ID: <[hidden email]>
>>> Content-Type: text/plain; charset="windows-1252"
>>>
>>> Hi Nasos,
>>>
>>> first of all I don?t want to take wrong credits and want to point out that this is not my algorithm.  It is based on
>>>
>>>    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>>
>>>    https://github.com/flame/blis
>>>
>>> For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
>>> sophisticated but still not too hard.  The demo I posted does not use micro kernels that exploit SSE, AVX or
>>> FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
>>> my own micro-kernels but the interface of function ugemm is compatible to BLIS.
>>>
>>> Maybe you could help me to integrate your code in the benchmark example I posted above.
>>>
>>> About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
>>> tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.
>>> IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2.  In that case it makes more
>>> sense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,
>>> let me know.
>>>
>>> About the block size: In my experience you get better performance if you chose them dynamically at runtime
>>> depending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 for
>>> blocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
>>> I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and having
>>> blocks of similar sizes easily pays of.
>>>
>>> Cheers,
>>>
>>> Michael
>>> *************************************
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ublas mailing list
>>> [hidden email]
>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>> Sent to: [hidden email]
>>
>>
>>
>
> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]
>

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

palik imre
In reply to this post by palik imre
Hi Michael,

I'm yet to check the machine code, but according to my understanding, your gemm approach is optimised for fmadd.

On my old AMD box, setting MR to 4, NR to 16, and flipping the loop on l with the loop on j in ugemm (using cache aligned blocks), caused a more than 2-times speedup compared to vanilla MR = 16, NR = 16 case:


$ g++ -Wall -W -std=gnu++11 -Ofast -mavx -o matprod2 matprod2.cc
$ ./matprod2
#   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2      MFLOPS        Diff nrm1
  100   100   100    0.0581846      34.3734      0.00188958      1058.44               0
  200   200   200     0.414865      38.5667       0.0123334      1297.29               0
  300   300   300      1.37691      39.2183       0.0387481      1393.62      1.0845e-06
  400   400   400      3.26061      39.2565       0.0883939      1448.06     4.04763e-06
  500   500   500      6.36875      39.2542          0.1813      1378.93     9.09805e-06
  600   600   600      10.8398       39.853        0.309016      1397.98     1.65132e-05
  700   700   700      17.2165      39.8454        0.484826      1414.94     2.72393e-05
  800   800   800      25.6953      39.8516        0.712666      1436.86      4.1317e-05
  900   900   900       36.655      39.7763         1.03119       1413.9     5.92538e-05
 1000  1000  1000      50.6028      39.5235         1.39929       1429.3     8.21395e-05
$ g++ -Wall -W -std=gnu++11 -Ofast -mavx -o matprod3 matprod3.cc
$ ./matprod3
#   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2      MFLOPS        Diff nrm1
  100   100   100    0.0528992      37.8078     0.000628359      3182.89               0
  200   200   200     0.414929      38.5609      0.00468125      3417.89               0
  300   300   300      1.37729      39.2073        0.014938      3614.94     1.08932e-06
  400   400   400      3.26166      39.2438       0.0340622      3757.83     4.06563e-06
  500   500   500      6.37056       39.243       0.0670238      3730.02      9.1488e-06
  600   600   600      10.8375      39.8615        0.112301       3846.8     1.65072e-05
  700   700   700      17.2216      39.8336        0.181384      3782.02     2.71309e-05
  800   800   800      25.7673      39.7402        0.280321      3652.95     4.14585e-05
  900   900   900      36.6848      39.7439        0.386776      3769.63     5.91959e-05
 1000  1000  1000      50.5229       39.586        0.525621      3805.03     8.21298e-05



--------------------------------------------
On Sat, 23/1/16, Michael Lehn <[hidden email]> wrote:

 Subject: Re: [ublas] Matrix multiplication performance
 To: "ublas mailing list" <[hidden email]>
 Cc: "palik imre" <[hidden email]>
 Date: Saturday, 23 January, 2016, 1:41
 
 I re-run all the
 benchmarks with “-Ofast -mavx” (as the hardware on with
 I generate the
 benchmarks does not have
 AVX2).  So now the original ublas::axpy_prod() is
 actually
 doing way better ...
 
 
 On 22 Jan
 2016, at 21:27, Michael Lehn <[hidden email]>
 wrote:
 
 > Wow Imre!
 >
 > Ok, that is actually
 a significant difference :-)
 >
 > I have just added a new version to my
 site.  Unfortunately the computer I used for creating
 > the page does not have posix_memalign(). 
 So I had to hack my own function for that.  But
 > I think for a proof of concept it will
 do.
 >
 > But most of
 all I also added micro-kernel that are (fairly) optimised
 for AVX and FMA.  The
 > micro-kernel for
 AVX require MR=4, NR=8.  For FMA it requires MR=4, NR=16. 
 Otherwise
 > the reference implementation
 gets selected.  For all parameters default values now can
 be
 > overwritten when compiling, e.g.
 >
 >     g++ -O3 -Wall
 -std=c++11 -DHAVE_AVX  -DBS_D_MC=512  matprod.cc
 >
 > The optimised micro
 kernels are only included when compiled with -DHAVE_AVX or
 -DHAVE_FMA
 >
 > I put
 all this stuff here
 >
 >     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html
 >
 > also the tar-ball
 
 >
 >     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz
 >
 > contains all
 required files:
 >
 >
     session2/avx.hpp
 >    
 session2/fma.hpp
 >    
 session2/gemm.hpp
 >    
 session2/matprod.cc
 >
 > Of course I would be interested how all
 this performs on other platforms.
 >
 >
 > Cheers,
 >
 > Michael
 >
 >
 > On 22 Jan 2016, at 16:20, palik imre
 <[hidden email]>
 wrote:
 >
 >> Hi
 All,
 >>
 >> In
 the meantime I enabled avx2 ...
 >>
 >> Theoretical CPU performance maximum:
 clock * vector-size * ALUs/Core * 2 ~ 69.6GFLOPS
 >> So it is at ~25%
 >>
 >> Compiler is
 gcc 4.8.3
 >>
 >>
 vanilla run:
 >> $ g++ -Wall -W
 -std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g
 -DNDEBUG -I gemm -o matprod matprod.cc
 >> $ ./matprod
 >>
 #   m     n     k 
 uBLAS:   t1   
    MFLOPS   Blocked:   t2 
     MFLOPS        Diff nrm1
 >> 
 100   100   100  0.000350402     
 5707.73      0.00116104      1722.59           
    0
 >> 
 200   200   200   0.00445094 
     3594.74      0.00819996      1951.23       
        0
 >> 
 300   300   300    0.0138515     
 3898.49       0.0266515      2026.15 
    1.06987e-06
 >> 
 400   400   400    0.0266447     
 4803.96       0.0613506      2086.37 
    4.01475e-06
 >> 
 500   500   500    0.0424372     
 5891.06        0.119345      2094.77 
    8.99605e-06
 >> 
 600   600   600    0.0724648     
 5961.51        0.203187      2126.12 
    1.63618e-05
 >> 
 700   700   700 
    0.115464      5941.25        0.325834 
     2105.36     2.69547e-05
 >> 
 800   800   800 
    0.173003      5918.98        0.480655 
     2130.43     4.09449e-05
 >> 
 900   900   900 
    0.248077       5877.2       
 0.689972      2113.13     5.87376e-05
 >> 1000  1000  1000      0.33781 
     5920.49        0.930591      2149.17 
    8.16264e-05
 >> 1100 
 1100  1100       0.5149      5169.93   
      1.25507         2121 
    0.000108883
 >> 1200 
 1200  1200     0.670628      5153.38     
    1.62732      2123.74 
    0.000141876
 >> 1300 
 1300  1300     0.852692      5153.09     
    2.06708      2125.71 
    0.000180926
 >> 1400 
 1400  1400      1.06695      5143.65     
    2.56183      2142.22 
    0.000225975
 >> 1500 
 1500  1500       1.3874   
    4865.2         3.16532     
 2132.49     0.000278553
 >>
 1600  1600  1600      1.77623      4612.03     
     3.8137      2148.05     0.000338106
 >> 1700  1700  1700   
    2.3773      4133.26     
    4.56665      2151.69 
    0.000404458
 >> 1800 
 1800  1800      3.06381      3807.03     
    5.40317      2158.73      0.00048119
 >> 1900  1900  1900   
    3.9039      3513.92     
    6.37295      2152.53 
    0.000564692
 >> 2000 
 2000  2000      4.79166      3339.13     
    7.43399      2152.28 
    0.000659714
 >> 2100 
 2100  2100      6.04946      3061.76     
    8.62429      2147.65 
    0.000762223
 >> 2200 
 2200  2200      7.39085       2881.4   
      9.86237      2159.32 
    0.000875624
 >> 2300 
 2300  2300      9.00453      2702.42     
    11.2513      2162.78      0.00100184
 >> 2400  2400  2400      10.3952 
     2659.68         12.7491     
 2168.62      0.00113563
 >> 2500 
 2500  2500      12.2283      2555.55     
    14.4615      2160.92      0.00128336
 >> 2600  2600  2600      13.8912 
     2530.51         16.1965     
 2170.34      0.00144304
 >> 2700 
 2700  2700       15.391      2557.72   
      18.1998      2162.99     
 0.00161411
 >> 2800  2800  2800   
   17.5673      2499.19         20.2171 
     2171.63      0.00180035
 >>
 2900  2900  2900      19.4621      2506.31     
    22.5482      2163.28      0.00199765
 >> 3000  3000  3000      21.4506 
     2517.42         24.9477     
 2164.53      0.00221028
 >> 3100 
 3100  3100        23.71      2512.95     
    27.5144      2165.48      0.00243877
 >> 3200  3200  3200      25.9051 
     2529.85         30.2816     
 2164.22      0.00267766
 >> 3300 
 3300  3300      28.1949      2549.18         
 33.176      2166.45      0.00293379
 >> 3400  3400  3400      30.7235 
     2558.56         36.0156     
 2182.61       0.0032087
 >> 3500  3500  3500      34.0419 
     2518.95         39.3929     
 2176.79      0.00349827
 >> 3600 
 3600  3600      37.0562      2518.12     
    42.7524      2182.62      0.00380447
 >> 3700  3700  3700      39.7885 
     2546.11         46.4748     
 2179.81      0.00412621
 >> 3800 
 3800  3800      43.6607      2513.56     
    50.2119      2185.62   
    0.0044694
 >> 3900 
 3900  3900      46.5104      2550.78     
    54.4822      2177.56      0.00482355
 >> 4000  4000  4000      50.6098 
     2529.15         58.7686     
 2178.03      0.00520289
 >>
 >> tuned run:
 >>
 
 >> $ g++ -Wall -W -std=gnu++11 -Ofast
 -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o
 matprod2 matprod2.cc
 >> $
 ./matprod2
 >> #   m 
    n     k  uBLAS:   t1 
  
    MFLOPS   Blocked:   t2 
     MFLOPS        Diff nrm1
 >> 
 100   100   100  0.000351671     
 5687.13     0.000316612      6316.88     
          0
 >> 
 200   200   200   0.00419531 
     3813.78      0.00159044      10060.1       
        0
 >> 
 300   300   300    0.0141153     
 3825.62      0.00421113      12823.2 
    1.07645e-06
 >> 
 400   400   400    0.0291599     
 4389.59      0.00858138        14916 
    4.00614e-06
 >> 
 500   500   500    0.0483492     
 5170.72       0.0166519      15013.3 
    8.96808e-06
 >> 
 600   600   600    0.0725783     
 5952.19       0.0279634      15448.7 
    1.63386e-05
 >> 
 700   700   700 
    0.113891      6023.29        0.043077 
       15925     2.69191e-05
 >> 
 800   800   800 
    0.171416      5973.79   
    0.0627796        16311 
    4.09782e-05
 >> 
 900   900   900 
    0.243677      5983.32   
    0.0922766      15800.3 
    5.88092e-05
 >> 1000 
 1000  1000     0.335158      5967.33     
   0.123339      16215.5     8.15988e-05
 >> 1100  1100  1100 
    0.515776      5161.15     
    0.16578      16057.5 
    0.000108991
 >> 1200 
 1200  1200     0.662706      5214.98     
   0.205989      16777.6     0.000141824
 >> 1300  1300  1300 
    0.845952      5194.15     
    0.27637        15899      0.00018111
 >> 1400  1400  1400      1.06712 
     5142.82        0.332118      16524.2 
    0.000225958
 >> 1500 
 1500  1500      1.38147      4886.11       
 0.409224      16494.6     0.000278265
 >> 1600  1600  1600      1.72238 
     4756.21        0.492314      16639.8 
    0.000338095
 >> 1700 
 1700  1700      2.38508      4119.77       
 0.603566      16279.9     0.000404362
 >> 1800  1800  1800      3.12034 
     3738.05        0.717409      16258.5 
    0.000481575
 >> 1900 
 1900  1900      3.93668      3484.66       
 0.824933      16629.2     0.000564727
 >> 2000  2000  2000      4.76038 
     3361.07        0.941643      16991.6     
 0.00065862
 >> 2100  2100  2100   
   5.90627      3135.99         1.12226 
     16504.2     0.000762307
 >> 2200  2200  2200      7.26419 
     2931.64         1.28213     
 16609.9     0.000876699
 >>
 2300  2300  2300      8.88171      2739.79     
    1.45247      16753.5      0.00100222
 >> 2400  2400  2400      10.4956 
     2634.26         1.62705     
 16992.7      0.00113566
 >> 2500 
 2500  2500       11.913      2623.18   
      1.87499      16666.7     
 0.00128371
 >> 2600  2600  2600   
   13.7057      2564.77          2.1156     
 16615.6      0.00144259
 >> 2700 
 2700  2700      15.5959      2524.13     
    2.33957      16826.1      0.00161501
 >> 2800  2800  2800      17.1121 
     2565.67         2.57445     
 17053.8      0.00179901
 >> 2900 
 2900  2900      19.4167      2512.16     
    2.92445      16679.4      0.00199764
 >> 3000  3000  3000      21.3239 
     2532.37         3.18891     
 16933.7      0.00220999
 >> 3100 
 3100  3100      23.5049      2534.88         
 3.5305      16876.4      0.00243845
 >> 3200  3200  3200      25.7362 
     2546.45         3.81708     
 17169.1      0.00267581
 >> 3300 
 3300  3300      28.4467      2526.62     
    4.25869        16877      0.00293513
 >> 3400  3400  3400      30.4607 
     2580.63         4.67999     
 16796.6      0.00320688
 >> 3500 
 3500  3500      33.7737      2538.96     
    5.04289      17004.1      0.00349667
 >> 3600  3600  3600      36.9633 
     2524.45           5.414     
 17235.3      0.00380237
 >> 3700 
 3700  3700      39.5153      2563.71     
    6.04875      16748.2      0.00412583
 >> 3800  3800  3800      42.9412 
     2555.68         6.48985     
 16910.1      0.00446785
 >> 3900 
 3900  3900      46.5282      2549.81     
    7.05844        16808      0.00482701
 >> 4000  4000  4000      50.2218 
     2548.69         7.42442     
 17240.4      0.00520272
 >>
 >> As the generated machine code is
 completely different, I guess gcc notices the aligned alloc,
 and uses the alignment information for optimisation.
 >>
 >> Cheers,
 >>
 >> Imre
 >>
 >>
 >> On Friday, 22 January 2016, 15:09,
 Michael Lehn <[hidden email]>
 wrote:
 >>
 >>
 
 >> Hi Imre,
 >>
 
 >> thanks for running the
 benchmarks.  Of course you are right that using aligned
 memory for the buffers improves
 >>
 performance.  I also did not really put any effort in
 optimising the parameters MC, NC, KC, MR and NR.  I will
 >> compare different variants and report
 them on the website
 >>
 >>     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
 >>
 >> I modified
 my benchmark program such that it also computes the FLOPS
 as
 >>
 >>    
 FLOPS = 2*m*n*k/time_elpased
 >>
 >> See
 >>
 >>     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
 >>
 >> Could you
 re-run your benchmarks and post the different MFLOPS you
 get? That is important for actually tuning thing.
 >> On my machine my code only reaches 20%
 of the peak performance (about 5 GFLOPS instead of 25.6
 GFLOPS).   So
 >> a speedup
 of 2.5 would be impressive but still far from peak
 performance.
 >>
 >> Cheers,
 >>
 >> Michael
 >>
 >>
 >> On 22 Jan
 2016, at 11:03, palik imre <[hidden email]>
 wrote:
 >>
 >>> Sorry for posting twice more or
 less the same thing.  I got confused with javascript
 interfaces.
 >>>
 >>> It seems I also forgot to enable
 avx for my last measurements.  With that + my blocking and
 alignment changes, performance according to my tests is
 something like 250% higher than running Michael's
 original code (with avx).
 >>>
 >>> Cheers,
 >>>
 >>>
 Imre
 >>>
 >>>
 >>> On
 Friday, 22 January 2016, 10:33, palik imre <[hidden email]>
 wrote:
 >>>
 >>>
 >>> Hi
 Michael,
 >>>
 >>> your blocksizes are far from
 optimal.  MR & NR should be multiples of the L1 cache
 line size (i.e. 16 for double on Intel).  Also, the blocks
 should be allocated aligned to L1 cache lines (e.g., via
 posix_memalign()).
 >>>
 >>> This alone brought something like
 50% speedup for my square matrix test.
 >>>
 >>> I
 will have a look at the other parameters + the whole thing
 via perf during the weekend.
 >>>
 
 >>> Cheers,
 >>>
 >>>
 Imre
 >>>
 >>>
 >>>
 >>> On Friday, 22 January 2016, 0:28,
 "[hidden email]"
 <[hidden email]>
 wrote:
 >>>
 >>>
 >>>
 Subject: Re: [ublas] Matrix multiplication performance
 >>> Message-ID: <[hidden email]>
 >>> Content-Type: text/plain;
 charset="windows-1252"
 >>>
 >>> Hi
 Nasos,
 >>>
 >>> first of all I don?t want to take
 wrong credits and want to point out that this is not my
 algorithm.  It is based on
 >>>
 >>>    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
 >>>
 >>>   
 https://github.com/flame/blis
 >>>
 >>> For
 a few cores (4-8) it can easily made multithreaded.  For
 many-cores like Intel Xeon Phi this is a bit more
 >>> sophisticated but still not too
 hard.  The demo I posted does not use micro kernels that
 exploit SSE, AVX or
 >>> FMA
 instructions.  With that the matrix product is on par with
 Intel MKL.  Just like BLIS. For my platforms I wrote
 >>> my own micro-kernels but the
 interface of function ugemm is compatible to BLIS.
 >>>
 >>>
 Maybe you could help me to integrate your code in the
 benchmark example I posted above.
 >>>
 >>>
 About Blaze:  Do they have their own implementation of a
 matrix-matrix product?  It seems to require a
 >>> tuned BLAS implementation
 (?Otherwise you get only poor performance?) for the
 matrix-matrix product.
 >>> IMHO
 they only have tuned the ?easy? stuff like BLAS Level1 and
 Level2.  In that case it makes more
 >>> sense to compare the performance
 with the actual underlying GEMM implementation.  But if I
 am wrong,
 >>> let me know.
 >>>
 >>>
 About the block size: In my experience you get better
 performance if you chose them dynamically at runtime
 >>> depending on the problem size. 
 Depending on the architecture you can just specify ranges
 like 256 - 384 for
 >>> blocking
 factor MC.  In my code it then also needs to satisfy the
 restriction that it can be divided by factor MR.
 >>> I know that doing things at
 compile time is a C++ fetish.  But the runtime overhead is
 negligible and having
 >>> blocks of
 similar sizes easily pays of.
 >>>
 
 >>> Cheers,
 >>>
 >>>
 Michael
 >>>
 *************************************
 >>>
 >>>
 >>>
 >>>
 >>>
 _______________________________________________
 >>> ublas mailing list
 >>> [hidden email]
 >>> http://lists.boost.org/mailman/listinfo.cgi/ublas
 >>> Sent to: [hidden email]
 >>
 >>
 >>
 >
 >
 _______________________________________________
 > ublas mailing list
 > [hidden email]
 > http://lists.boost.org/mailman/listinfo.cgi/ublas
 > Sent to: [hidden email]
 >
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Michael Lehn-2

> I'm yet to check the machine code, but according to my understanding, your gemm approach is optimised for fmadd.

Only the micro-kernel enabled with -DHAVE_FMA uses fused-multiply-add instructions.  It should work if
cat /proc/cpuinfo shows “fma”.  Otherwise the micro-kernel with -DHAVE_AVX should work.


>
> On my old AMD box, setting MR to 4, NR to 16, and flipping the loop on l with the loop on j in ugemm (using cache aligned blocks), caused a more than 2-times speedup compared to vanilla MR = 16, NR = 16 case:

Ok, that can be.  I never really have gone into optimising the plain-C++-micro-kernel.  I just used it as
a reference implementation.  But I guess it is not easy to find general good constants.  In a real world
library it might be good to have some config-benchmarks that empirically set these.  Similar to ATLAS.

Also the optimality of the assembly micro-kernel are tricky.  E.g. I have two older machines that only have
SSE (Intel-Core-Duo and a Xeon).  Only looking at the SSE3 entry at /proc/cpuinfo would suggest that
the same micro-kernel should be used.  But actually they require different asm-micro-kernel that realise
different algorithms.  One works with register-broadcasts the other with register-permutations.  Choosing
the “wrong” micro-kernel gives terrible results.  Choosing the “right” micro-kernel gets close to MKL.


>
>
> $ g++ -Wall -W -std=gnu++11 -Ofast -mavx -o matprod2 matprod2.cc
> $ ./matprod2
> #   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2      MFLOPS        Diff nrm1
>  100   100   100    0.0581846      34.3734      0.00188958      1058.44               0
>  200   200   200     0.414865      38.5667       0.0123334      1297.29               0
>  300   300   300      1.37691      39.2183       0.0387481      1393.62      1.0845e-06
>  400   400   400      3.26061      39.2565       0.0883939      1448.06     4.04763e-06
>  500   500   500      6.36875      39.2542          0.1813      1378.93     9.09805e-06
>  600   600   600      10.8398       39.853        0.309016      1397.98     1.65132e-05
>  700   700   700      17.2165      39.8454        0.484826      1414.94     2.72393e-05
>  800   800   800      25.6953      39.8516        0.712666      1436.86      4.1317e-05
>  900   900   900       36.655      39.7763         1.03119       1413.9     5.92538e-05
> 1000  1000  1000      50.6028      39.5235         1.39929       1429.3     8.21395e-05
> $ g++ -Wall -W -std=gnu++11 -Ofast -mavx -o matprod3 matprod3.cc
> $ ./matprod3
> #   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2      MFLOPS        Diff nrm1
>  100   100   100    0.0528992      37.8078     0.000628359      3182.89               0
>  200   200   200     0.414929      38.5609      0.00468125      3417.89               0
>  300   300   300      1.37729      39.2073        0.014938      3614.94     1.08932e-06
>  400   400   400      3.26166      39.2438       0.0340622      3757.83     4.06563e-06
>  500   500   500      6.37056       39.243       0.0670238      3730.02      9.1488e-06
>  600   600   600      10.8375      39.8615        0.112301       3846.8     1.65072e-05
>  700   700   700      17.2216      39.8336        0.181384      3782.02     2.71309e-05
>  800   800   800      25.7673      39.7402        0.280321      3652.95     4.14585e-05
>  900   900   900      36.6848      39.7439        0.386776      3769.63     5.91959e-05
> 1000  1000  1000      50.5229       39.586        0.525621      3805.03     8.21298e-05
>
>
>
> --------------------------------------------
> On Sat, 23/1/16, Michael Lehn <[hidden email]> wrote:
>
> Subject: Re: [ublas] Matrix multiplication performance
> To: "ublas mailing list" <[hidden email]>
> Cc: "palik imre" <[hidden email]>
> Date: Saturday, 23 January, 2016, 1:41
>
> I re-run all the
> benchmarks with “-Ofast -mavx” (as the hardware on with
> I generate the
> benchmarks does not have
> AVX2).  So now the original ublas::axpy_prod() is
> actually
> doing way better ...
>
>
> On 22 Jan
> 2016, at 21:27, Michael Lehn <[hidden email]>
> wrote:
>
>> Wow Imre!
>>
>> Ok, that is actually
> a significant difference :-)
>>
>> I have just added a new version to my
> site.  Unfortunately the computer I used for creating
>> the page does not have posix_memalign().
> So I had to hack my own function for that.  But
>> I think for a proof of concept it will
> do.
>>
>> But most of
> all I also added micro-kernel that are (fairly) optimised
> for AVX and FMA.  The
>> micro-kernel for
> AVX require MR=4, NR=8.  For FMA it requires MR=4, NR=16.
> Otherwise
>> the reference implementation
> gets selected.  For all parameters default values now can
> be
>> overwritten when compiling, e.g.
>>
>>     g++ -O3 -Wall
> -std=c++11 -DHAVE_AVX  -DBS_D_MC=512  matprod.cc
>>
>> The optimised micro
> kernels are only included when compiled with -DHAVE_AVX or
> -DHAVE_FMA
>>
>> I put
> all this stuff here
>>
>>     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html
>>
>> also the tar-ball
>
>>
>>     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz
>>
>> contains all
> required files:
>>
>>
>     session2/avx.hpp
>>    
> session2/fma.hpp
>>    
> session2/gemm.hpp
>>    
> session2/matprod.cc
>>
>> Of course I would be interested how all
> this performs on other platforms.
>>
>>
>> Cheers,
>>
>> Michael
>>
>>
>> On 22 Jan 2016, at 16:20, palik imre
> <[hidden email]>
> wrote:
>>
>>> Hi
> All,
>>>
>>> In
> the meantime I enabled avx2 ...
>>>
>>> Theoretical CPU performance maximum:
> clock * vector-size * ALUs/Core * 2 ~ 69.6GFLOPS
>>> So it is at ~25%
>>>
>>> Compiler is
> gcc 4.8.3
>>>
>>>
> vanilla run:
>>> $ g++ -Wall -W
> -std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g
> -DNDEBUG -I gemm -o matprod matprod.cc
>>> $ ./matprod
>>>
> #   m     n     k
> uBLAS:   t1  
>    MFLOPS   Blocked:   t2
>     MFLOPS        Diff nrm1
>>>  
> 100   100   100  0.000350402    
> 5707.73      0.00116104      1722.59          
>    0
>>>  
> 200   200   200   0.00445094
>     3594.74      0.00819996      1951.23      
>        0
>>>  
> 300   300   300    0.0138515    
> 3898.49       0.0266515      2026.15
>    1.06987e-06
>>>  
> 400   400   400    0.0266447    
> 4803.96       0.0613506      2086.37
>    4.01475e-06
>>>  
> 500   500   500    0.0424372    
> 5891.06        0.119345      2094.77
>    8.99605e-06
>>>  
> 600   600   600    0.0724648    
> 5961.51        0.203187      2126.12
>    1.63618e-05
>>>  
> 700   700   700
>    0.115464      5941.25        0.325834
>     2105.36     2.69547e-05
>>>  
> 800   800   800
>    0.173003      5918.98        0.480655
>     2130.43     4.09449e-05
>>>  
> 900   900   900
>    0.248077       5877.2      
> 0.689972      2113.13     5.87376e-05
>>> 1000  1000  1000      0.33781
>     5920.49        0.930591      2149.17
>    8.16264e-05
>>> 1100
> 1100  1100       0.5149      5169.93  
>      1.25507         2121
>    0.000108883
>>> 1200
> 1200  1200     0.670628      5153.38    
>    1.62732      2123.74
>    0.000141876
>>> 1300
> 1300  1300     0.852692      5153.09    
>    2.06708      2125.71
>    0.000180926
>>> 1400
> 1400  1400      1.06695      5143.65    
>    2.56183      2142.22
>    0.000225975
>>> 1500
> 1500  1500       1.3874  
>    4865.2         3.16532    
> 2132.49     0.000278553
>>>
> 1600  1600  1600      1.77623      4612.03    
>     3.8137      2148.05     0.000338106
>>> 1700  1700  1700  
>    2.3773      4133.26    
>    4.56665      2151.69
>    0.000404458
>>> 1800
> 1800  1800      3.06381      3807.03    
>    5.40317      2158.73      0.00048119
>>> 1900  1900  1900  
>    3.9039      3513.92    
>    6.37295      2152.53
>    0.000564692
>>> 2000
> 2000  2000      4.79166      3339.13    
>    7.43399      2152.28
>    0.000659714
>>> 2100
> 2100  2100      6.04946      3061.76    
>    8.62429      2147.65
>    0.000762223
>>> 2200
> 2200  2200      7.39085       2881.4  
>      9.86237      2159.32
>    0.000875624
>>> 2300
> 2300  2300      9.00453      2702.42    
>    11.2513      2162.78      0.00100184
>>> 2400  2400  2400      10.3952
>     2659.68         12.7491    
> 2168.62      0.00113563
>>> 2500
> 2500  2500      12.2283      2555.55    
>    14.4615      2160.92      0.00128336
>>> 2600  2600  2600      13.8912
>     2530.51         16.1965    
> 2170.34      0.00144304
>>> 2700
> 2700  2700       15.391      2557.72  
>      18.1998      2162.99    
> 0.00161411
>>> 2800  2800  2800  
>   17.5673      2499.19         20.2171
>     2171.63      0.00180035
>>>
> 2900  2900  2900      19.4621      2506.31    
>    22.5482      2163.28      0.00199765
>>> 3000  3000  3000      21.4506
>     2517.42         24.9477    
> 2164.53      0.00221028
>>> 3100
> 3100  3100        23.71      2512.95    
>    27.5144      2165.48      0.00243877
>>> 3200  3200  3200      25.9051
>     2529.85         30.2816    
> 2164.22      0.00267766
>>> 3300
> 3300  3300      28.1949      2549.18        
> 33.176      2166.45      0.00293379
>>> 3400  3400  3400      30.7235
>     2558.56         36.0156    
> 2182.61       0.0032087
>>> 3500  3500  3500      34.0419
>     2518.95         39.3929    
> 2176.79      0.00349827
>>> 3600
> 3600  3600      37.0562      2518.12    
>    42.7524      2182.62      0.00380447
>>> 3700  3700  3700      39.7885
>     2546.11         46.4748    
> 2179.81      0.00412621
>>> 3800
> 3800  3800      43.6607      2513.56    
>    50.2119      2185.62  
>    0.0044694
>>> 3900
> 3900  3900      46.5104      2550.78    
>    54.4822      2177.56      0.00482355
>>> 4000  4000  4000      50.6098
>     2529.15         58.7686    
> 2178.03      0.00520289
>>>
>>> tuned run:
>>>
>
>>> $ g++ -Wall -W -std=gnu++11 -Ofast
> -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o
> matprod2 matprod2.cc
>>> $
> ./matprod2
>>> #   m
>    n     k  uBLAS:   t1
>  
>    MFLOPS   Blocked:   t2
>     MFLOPS        Diff nrm1
>>>  
> 100   100   100  0.000351671    
> 5687.13     0.000316612      6316.88    
>          0
>>>  
> 200   200   200   0.00419531
>     3813.78      0.00159044      10060.1      
>        0
>>>  
> 300   300   300    0.0141153    
> 3825.62      0.00421113      12823.2
>    1.07645e-06
>>>  
> 400   400   400    0.0291599    
> 4389.59      0.00858138        14916
>    4.00614e-06
>>>  
> 500   500   500    0.0483492    
> 5170.72       0.0166519      15013.3
>    8.96808e-06
>>>  
> 600   600   600    0.0725783    
> 5952.19       0.0279634      15448.7
>    1.63386e-05
>>>  
> 700   700   700
>    0.113891      6023.29        0.043077
>       15925     2.69191e-05
>>>  
> 800   800   800
>    0.171416      5973.79  
>    0.0627796        16311
>    4.09782e-05
>>>  
> 900   900   900
>    0.243677      5983.32  
>    0.0922766      15800.3
>    5.88092e-05
>>> 1000
> 1000  1000     0.335158      5967.33    
>   0.123339      16215.5     8.15988e-05
>>> 1100  1100  1100
>    0.515776      5161.15    
>    0.16578      16057.5
>    0.000108991
>>> 1200
> 1200  1200     0.662706      5214.98    
>   0.205989      16777.6     0.000141824
>>> 1300  1300  1300
>    0.845952      5194.15    
>    0.27637        15899      0.00018111
>>> 1400  1400  1400      1.06712
>     5142.82        0.332118      16524.2
>    0.000225958
>>> 1500
> 1500  1500      1.38147      4886.11      
> 0.409224      16494.6     0.000278265
>>> 1600  1600  1600      1.72238
>     4756.21        0.492314      16639.8
>    0.000338095
>>> 1700
> 1700  1700      2.38508      4119.77      
> 0.603566      16279.9     0.000404362
>>> 1800  1800  1800      3.12034
>     3738.05        0.717409      16258.5
>    0.000481575
>>> 1900
> 1900  1900      3.93668      3484.66      
> 0.824933      16629.2     0.000564727
>>> 2000  2000  2000      4.76038
>     3361.07        0.941643      16991.6    
> 0.00065862
>>> 2100  2100  2100  
>   5.90627      3135.99         1.12226
>     16504.2     0.000762307
>>> 2200  2200  2200      7.26419
>     2931.64         1.28213    
> 16609.9     0.000876699
>>>
> 2300  2300  2300      8.88171      2739.79    
>    1.45247      16753.5      0.00100222
>>> 2400  2400  2400      10.4956
>     2634.26         1.62705    
> 16992.7      0.00113566
>>> 2500
> 2500  2500       11.913      2623.18  
>      1.87499      16666.7    
> 0.00128371
>>> 2600  2600  2600  
>   13.7057      2564.77          2.1156    
> 16615.6      0.00144259
>>> 2700
> 2700  2700      15.5959      2524.13    
>    2.33957      16826.1      0.00161501
>>> 2800  2800  2800      17.1121
>     2565.67         2.57445    
> 17053.8      0.00179901
>>> 2900
> 2900  2900      19.4167      2512.16    
>    2.92445      16679.4      0.00199764
>>> 3000  3000  3000      21.3239
>     2532.37         3.18891    
> 16933.7      0.00220999
>>> 3100
> 3100  3100      23.5049      2534.88        
> 3.5305      16876.4      0.00243845
>>> 3200  3200  3200      25.7362
>     2546.45         3.81708    
> 17169.1      0.00267581
>>> 3300
> 3300  3300      28.4467      2526.62    
>    4.25869        16877      0.00293513
>>> 3400  3400  3400      30.4607
>     2580.63         4.67999    
> 16796.6      0.00320688
>>> 3500
> 3500  3500      33.7737      2538.96    
>    5.04289      17004.1      0.00349667
>>> 3600  3600  3600      36.9633
>     2524.45           5.414    
> 17235.3      0.00380237
>>> 3700
> 3700  3700      39.5153      2563.71    
>    6.04875      16748.2      0.00412583
>>> 3800  3800  3800      42.9412
>     2555.68         6.48985    
> 16910.1      0.00446785
>>> 3900
> 3900  3900      46.5282      2549.81    
>    7.05844        16808      0.00482701
>>> 4000  4000  4000      50.2218
>     2548.69         7.42442    
> 17240.4      0.00520272
>>>
>>> As the generated machine code is
> completely different, I guess gcc notices the aligned alloc,
> and uses the alignment information for optimisation.
>>>
>>> Cheers,
>>>
>>> Imre
>>>
>>>
>>> On Friday, 22 January 2016, 15:09,
> Michael Lehn <[hidden email]>
> wrote:
>>>
>>>
>
>>> Hi Imre,
>>>
>
>>> thanks for running the
> benchmarks.  Of course you are right that using aligned
> memory for the buffers improves
>>>
> performance.  I also did not really put any effort in
> optimising the parameters MC, NC, KC, MR and NR.  I will
>>> compare different variants and report
> them on the website
>>>
>>>     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
>>>
>>> I modified
> my benchmark program such that it also computes the FLOPS
> as
>>>
>>>    
> FLOPS = 2*m*n*k/time_elpased
>>>
>>> See
>>>
>>>     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
>>>
>>> Could you
> re-run your benchmarks and post the different MFLOPS you
> get? That is important for actually tuning thing.
>>> On my machine my code only reaches 20%
> of the peak performance (about 5 GFLOPS instead of 25.6
> GFLOPS).   So
>>> a speedup
> of 2.5 would be impressive but still far from peak
> performance.
>>>
>>> Cheers,
>>>
>>> Michael
>>>
>>>
>>> On 22 Jan
> 2016, at 11:03, palik imre <[hidden email]>
> wrote:
>>>
>>>> Sorry for posting twice more or
> less the same thing.  I got confused with javascript
> interfaces.
>>>>
>>>> It seems I also forgot to enable
> avx for my last measurements.  With that + my blocking and
> alignment changes, performance according to my tests is
> something like 250% higher than running Michael's
> original code (with avx).
>>>>
>>>> Cheers,
>>>>
>>>>
> Imre
>>>>
>>>>
>>>> On
> Friday, 22 January 2016, 10:33, palik imre <[hidden email]>
> wrote:
>>>>
>>>>
>>>> Hi
> Michael,
>>>>
>>>> your blocksizes are far from
> optimal.  MR & NR should be multiples of the L1 cache
> line size (i.e. 16 for double on Intel).  Also, the blocks
> should be allocated aligned to L1 cache lines (e.g., via
> posix_memalign()).
>>>>
>>>> This alone brought something like
> 50% speedup for my square matrix test.
>>>>
>>>> I
> will have a look at the other parameters + the whole thing
> via perf during the weekend.
>>>>
>
>>>> Cheers,
>>>>
>>>>
> Imre
>>>>
>>>>
>>>>
>>>> On Friday, 22 January 2016, 0:28,
> "[hidden email]"
> <[hidden email]>
> wrote:
>>>>
>>>>
>>>>
> Subject: Re: [ublas] Matrix multiplication performance
>>>> Message-ID: <[hidden email]>
>>>> Content-Type: text/plain;
> charset="windows-1252"
>>>>
>>>> Hi
> Nasos,
>>>>
>>>> first of all I don?t want to take
> wrong credits and want to point out that this is not my
> algorithm.  It is based on
>>>>
>>>>     http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>>>
>>>>    
> https://github.com/flame/blis
>>>>
>>>> For
> a few cores (4-8) it can easily made multithreaded.  For
> many-cores like Intel Xeon Phi this is a bit more
>>>> sophisticated but still not too
> hard.  The demo I posted does not use micro kernels that
> exploit SSE, AVX or
>>>> FMA
> instructions.  With that the matrix product is on par with
> Intel MKL.  Just like BLIS. For my platforms I wrote
>>>> my own micro-kernels but the
> interface of function ugemm is compatible to BLIS.
>>>>
>>>>
> Maybe you could help me to integrate your code in the
> benchmark example I posted above.
>>>>
>>>>
> About Blaze:  Do they have their own implementation of a
> matrix-matrix product?  It seems to require a
>>>> tuned BLAS implementation
> (?Otherwise you get only poor performance?) for the
> matrix-matrix product.
>>>> IMHO
> they only have tuned the ?easy? stuff like BLAS Level1 and
> Level2.  In that case it makes more
>>>> sense to compare the performance
> with the actual underlying GEMM implementation.  But if I
> am wrong,
>>>> let me know.
>>>>
>>>>
> About the block size: In my experience you get better
> performance if you chose them dynamically at runtime
>>>> depending on the problem size.
> Depending on the architecture you can just specify ranges
> like 256 - 384 for
>>>> blocking
> factor MC.  In my code it then also needs to satisfy the
> restriction that it can be divided by factor MR.
>>>> I know that doing things at
> compile time is a C++ fetish.  But the runtime overhead is
> negligible and having
>>>> blocks of
> similar sizes easily pays of.
>>>>
>
>>>> Cheers,
>>>>
>>>>
> Michael
>>>>
> *************************************
>>>>
>>>>
>>>>
>>>>
>>>>
> _______________________________________________
>>>> ublas mailing list
>>>> [hidden email]
>>>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>>>> Sent to: [hidden email]
>>>
>>>
>>>
>>
>>
> _______________________________________________
>> ublas mailing list
>> [hidden email]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: [hidden email]
>>
>

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Loading...