Quantcast

Matrix multiplication performance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
79 messages Options
1234
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Matrix multiplication performance

palik imre
Hi All,

what's next?  I mean what is the development process for ublas?

Now we have a C-like implementation that outperforms both the mainline, and the branch version (axpy_prod).  What will we do with that?

As far as I see we have the following options:

1) Create a C++ template magic implementation out of it.  But for this, at the least we would need compile-time access to the target instruction set.  Any idea how to do that?

2) Create a compiled library implementation out of it, and choose the implementation run-time based on the CPU capabilities.

3) Include some good defaults/defines, and hope the user will use them.

4) Don't include it, and do something completely different.


What do you think?

Cheers,

Imre

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

palik imre
Hi Michael,

I had a look on your AVX microkernel on my old AMD box.  Congratulations,  it is twice as fast as what I managed to get out of my optimised C kernel.

I wonder if I can catch up with that using newer gcc + SIMD arrays or intrinsics.

Cheers,

Imre

--------------------------------------------
On Fri, 22/1/16, Michael Lehn <[hidden email]> wrote:

 Subject: Re: [ublas] Matrix multiplication performance
 To: "palik imre" <[hidden email]>
 Cc: "ublas mailing list" <[hidden email]>
 Date: Friday, 22 January, 2016, 21:27
 
 Wow Imre!
 
 Ok, that is actually a
 significant difference :-)
 
 I have just added a new version to my site. 
 Unfortunately the computer I used for creating
 the page does not have posix_memalign().  So I
 had to hack my own function for that.  But
 I think for a proof of concept it will do.
 
 But most of all I also added
 micro-kernel that are (fairly) optimised for AVX and FMA. 
 The
 micro-kernel for AVX require MR=4,
 NR=8.  For FMA it requires MR=4, NR=16.  Otherwise
 the reference implementation gets selected. 
 For all parameters default values now can be
 overwritten when compiling, e.g.
 
     g++ -O3 -Wall
 -std=c++11 -DHAVE_AVX  -DBS_D_MC=512  matprod.cc
 
 The optimised micro kernels
 are only included when compiled with -DHAVE_AVX or
 -DHAVE_FMA
 
 I put all this
 stuff here
 
     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html
 
 also the tar-ball
 
     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz
 
 contains all required
 files:
 
    
 session2/avx.hpp
     session2/fma.hpp
     session2/gemm.hpp
    
 session2/matprod.cc
 
 Of
 course I would be interested how all this performs on other
 platforms.
 
 
 Cheers,
 
 Michael
 
 
 On 22
 Jan 2016, at 16:20, palik imre <[hidden email]>
 wrote:
 
 > Hi All,
 >
 > In the meantime I
 enabled avx2 ...
 >
 >
 Theoretical CPU performance maximum: clock * vector-size *
 ALUs/Core * 2 ~ 69.6GFLOPS
 > So it is at
 ~25%
 >
 > Compiler is
 gcc 4.8.3
 >
 > vanilla
 run:
 > $ g++ -Wall -W -std=gnu++11 -Ofast
 -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o
 matprod matprod.cc
 > $ ./matprod
 > #   m     n 
    k  uBLAS:   t1   
    MFLOPS   Blocked:   t2 
     MFLOPS        Diff nrm1
 >   100   100   100 
 0.000350402      5707.73      0.00116104     
 1722.59               0
 >   200   200   200   0.00445094 
     3594.74      0.00819996      1951.23       
        0
 >   300   300   300 
   0.0138515      3898.49       0.0266515 
     2026.15     1.06987e-06
 >   400   400   400 
   0.0266447      4803.96       0.0613506 
     2086.37     4.01475e-06
 >   500   500   500 
   0.0424372      5891.06        0.119345     
 2094.77     8.99605e-06
 >   600   600   600 
   0.0724648      5961.51        0.203187     
 2126.12     1.63618e-05
 >   700   700   700 
    0.115464      5941.25        0.325834 
     2105.36     2.69547e-05
 >   800   800   800 
    0.173003      5918.98        0.480655 
     2130.43     4.09449e-05
 >   900   900   900 
    0.248077       5877.2       
 0.689972      2113.13     5.87376e-05
 >  1000  1000  1000      0.33781   
   5920.49        0.930591      2149.17 
    8.16264e-05
 >  1100 
 1100  1100       0.5149      5169.93   
      1.25507         2121 
    0.000108883
 >  1200 
 1200  1200     0.670628      5153.38     
    1.62732      2123.74 
    0.000141876
 >  1300 
 1300  1300     0.852692      5153.09     
    2.06708      2125.71 
    0.000180926
 >  1400 
 1400  1400      1.06695      5143.65     
    2.56183      2142.22 
    0.000225975
 >  1500 
 1500  1500       1.3874   
    4865.2         3.16532     
 2132.49     0.000278553
 > 
 1600  1600  1600      1.77623      4612.03     
     3.8137      2148.05     0.000338106
 >  1700  1700  1700   
    2.3773      4133.26     
    4.56665      2151.69 
    0.000404458
 >  1800 
 1800  1800      3.06381      3807.03     
    5.40317      2158.73      0.00048119
 >  1900  1900  1900   
    3.9039      3513.92     
    6.37295      2152.53 
    0.000564692
 >  2000 
 2000  2000      4.79166      3339.13     
    7.43399      2152.28 
    0.000659714
 >  2100 
 2100  2100      6.04946      3061.76     
    8.62429      2147.65 
    0.000762223
 >  2200 
 2200  2200      7.39085       2881.4   
      9.86237      2159.32 
    0.000875624
 >  2300 
 2300  2300      9.00453      2702.42     
    11.2513      2162.78      0.00100184
 >  2400  2400  2400      10.3952   
   2659.68         12.7491      2168.62 
     0.00113563
 >  2500  2500  2500 
     12.2283      2555.55     
    14.4615      2160.92      0.00128336
 >  2600  2600  2600      13.8912   
   2530.51         16.1965      2170.34 
     0.00144304
 >  2700  2700  2700 
      15.391      2557.72     
    18.1998      2162.99      0.00161411
 >  2800  2800  2800      17.5673   
   2499.19         20.2171      2171.63 
     0.00180035
 >  2900  2900  2900 
     19.4621      2506.31     
    22.5482      2163.28      0.00199765
 >  3000  3000  3000      21.4506   
   2517.42         24.9477      2164.53 
     0.00221028
 >  3100  3100  3100 
       23.71      2512.95     
    27.5144      2165.48      0.00243877
 >  3200  3200  3200      25.9051   
   2529.85         30.2816      2164.22 
     0.00267766
 >  3300  3300  3300 
     28.1949      2549.18          33.176     
 2166.45      0.00293379
 >  3400 
 3400  3400      30.7235      2558.56     
    36.0156      2182.61   
    0.0032087
 >  3500  3500 
 3500      34.0419      2518.95     
    39.3929      2176.79      0.00349827
 >  3600  3600  3600      37.0562   
   2518.12         42.7524      2182.62 
     0.00380447
 >  3700  3700  3700 
     39.7885      2546.11     
    46.4748      2179.81      0.00412621
 >  3800  3800  3800      43.6607   
   2513.56         50.2119      2185.62 
      0.0044694
 >  3900 
 3900  3900      46.5104      2550.78     
    54.4822      2177.56      0.00482355
 >  4000  4000  4000      50.6098   
   2529.15         58.7686      2178.03 
     0.00520289
 >
 >
 tuned run:
 >
 > $ g++
 -Wall -W -std=gnu++11 -Ofast -march=core-avx2
 -mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod2
 matprod2.cc
 > $ ./matprod2
 > #   m     n 
    k  uBLAS:   t1   
    MFLOPS   Blocked:   t2 
     MFLOPS        Diff nrm1
 >   100   100   100 
 0.000351671      5687.13     0.000316612   
   6316.88               0
 >   200   200   200   0.00419531 
     3813.78      0.00159044      10060.1       
        0
 >   300   300   300 
   0.0141153      3825.62      0.00421113     
 12823.2     1.07645e-06
 >   400   400   400 
   0.0291599      4389.59      0.00858138       
 14916     4.00614e-06
 >   500   500   500 
   0.0483492      5170.72       0.0166519 
     15013.3     8.96808e-06
 >   600   600   600 
   0.0725783      5952.19       0.0279634 
     15448.7     1.63386e-05
 >   700   700   700 
    0.113891      6023.29        0.043077 
       15925     2.69191e-05
 >   800   800   800 
    0.171416      5973.79   
    0.0627796        16311 
    4.09782e-05
 >   900   900   900 
    0.243677      5983.32   
    0.0922766      15800.3 
    5.88092e-05
 >  1000 
 1000  1000     0.335158      5967.33     
   0.123339      16215.5     8.15988e-05
 >  1100  1100  1100 
    0.515776      5161.15     
    0.16578      16057.5 
    0.000108991
 >  1200 
 1200  1200     0.662706      5214.98     
   0.205989      16777.6     0.000141824
 >  1300  1300  1300 
    0.845952      5194.15     
    0.27637        15899      0.00018111
 >  1400  1400  1400      1.06712   
   5142.82        0.332118      16524.2 
    0.000225958
 >  1500 
 1500  1500      1.38147      4886.11       
 0.409224      16494.6     0.000278265
 >  1600  1600  1600      1.72238   
   4756.21        0.492314      16639.8 
    0.000338095
 >  1700 
 1700  1700      2.38508      4119.77       
 0.603566      16279.9     0.000404362
 >  1800  1800  1800      3.12034   
   3738.05        0.717409      16258.5 
    0.000481575
 >  1900 
 1900  1900      3.93668      3484.66       
 0.824933      16629.2     0.000564727
 >  2000  2000  2000      4.76038   
   3361.07        0.941643      16991.6     
 0.00065862
 >  2100  2100  2100     
 5.90627      3135.99         1.12226   
   16504.2     0.000762307
 >  2200  2200  2200      7.26419   
   2931.64         1.28213      16609.9 
    0.000876699
 >  2300 
 2300  2300      8.88171      2739.79     
    1.45247      16753.5      0.00100222
 >  2400  2400  2400      10.4956   
   2634.26         1.62705      16992.7 
     0.00113566
 >  2500  2500  2500 
      11.913      2623.18     
    1.87499      16666.7      0.00128371
 >  2600  2600  2600      13.7057   
   2564.77          2.1156      16615.6     
 0.00144259
 >  2700  2700  2700     
 15.5959      2524.13         2.33957   
   16826.1      0.00161501
 >  2800 
 2800  2800      17.1121      2565.67     
    2.57445      17053.8      0.00179901
 >  2900  2900  2900      19.4167   
   2512.16         2.92445      16679.4 
     0.00199764
 >  3000  3000  3000 
     21.3239      2532.37     
    3.18891      16933.7      0.00220999
 >  3100  3100  3100      23.5049   
   2534.88          3.5305      16876.4     
 0.00243845
 >  3200  3200  3200     
 25.7362      2546.45         3.81708   
   17169.1      0.00267581
 >  3300 
 3300  3300      28.4467      2526.62     
    4.25869        16877      0.00293513
 >  3400  3400  3400      30.4607   
   2580.63         4.67999      16796.6 
     0.00320688
 >  3500  3500  3500 
     33.7737      2538.96     
    5.04289      17004.1      0.00349667
 >  3600  3600  3600      36.9633   
   2524.45           5.414      17235.3 
     0.00380237
 >  3700  3700  3700 
     39.5153      2563.71     
    6.04875      16748.2      0.00412583
 >  3800  3800  3800      42.9412   
   2555.68         6.48985      16910.1 
     0.00446785
 >  3900  3900  3900 
     46.5282      2549.81     
    7.05844        16808      0.00482701
 >  4000  4000  4000      50.2218   
   2548.69         7.42442      17240.4 
     0.00520272
 >
 >
 As the generated machine code is completely different, I
 guess gcc notices the aligned alloc, and uses the alignment
 information for optimisation.
 >
 > Cheers,
 >
 > Imre
 >
 >
 > On Friday, 22
 January 2016, 15:09, Michael Lehn <[hidden email]>
 wrote:
 >
 >
 > Hi Imre,
 >
 > thanks for running the benchmarks.  Of
 course you are right that using aligned memory for the
 buffers improves
 > performance.  I also
 did not really put any effort in optimising the parameters
 MC, NC, KC, MR and NR.  I will
 > compare
 different variants and report them on the website
 >
 >     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
 >
 > I modified my
 benchmark program such that it also computes the FLOPS as
 >
 >     FLOPS =
 2*m*n*k/time_elpased
 >
 > See
 >
 >     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
 >
 > Could you re-run
 your benchmarks and post the different MFLOPS you get? That
 is important for actually tuning thing.
 >
 On my machine my code only reaches 20% of the peak
 performance (about 5 GFLOPS instead of 25.6
 GFLOPS).   So
 > a speedup of
 2.5 would be impressive but still far from peak
 performance.
 >
 >
 Cheers,
 >
 >
 Michael
 >
 >
 > On 22 Jan 2016, at 11:03, palik imre
 <[hidden email]>
 wrote:
 >
 >> Sorry
 for posting twice more or less the same thing.  I got
 confused with javascript interfaces.
 >>
 >> It seems I
 also forgot to enable avx for my last measurements.  With
 that + my blocking and alignment changes, performance
 according to my tests is something like 250% higher than
 running Michael's original code (with avx).
 >>
 >> Cheers,
 >>
 >> Imre
 >>
 >>
 >> On Friday, 22 January 2016, 10:33,
 palik imre <[hidden email]>
 wrote:
 >>
 >>
 
 >> Hi Michael,
 >>
 >> your
 blocksizes are far from optimal.  MR & NR should be
 multiples of the L1 cache line size (i.e. 16 for double on
 Intel).  Also, the blocks should be allocated aligned to L1
 cache lines (e.g., via posix_memalign()).
 >>
 >> This alone
 brought something like 50% speedup for my square matrix
 test.
 >>
 >> I
 will have a look at the other parameters + the whole thing
 via perf during the weekend.
 >>
 >> Cheers,
 >>
 >> Imre
 >>
 >>
 >>
 >> On Friday, 22 January 2016, 0:28,
 "[hidden email]"
 <[hidden email]>
 wrote:
 >>
 >>
 
 >> Subject: Re: [ublas] Matrix
 multiplication performance
 >>
 Message-ID: <[hidden email]>
 >> Content-Type: text/plain;
 charset="windows-1252"
 >>
 
 >> Hi Nasos,
 >>
 
 >> first of all I don?t want to take
 wrong credits and want to point out that this is not my
 algorithm.  It is based on
 >>
 >>     http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
 >>
 >> 
    https://github.com/flame/blis
 >>
 >> For a few
 cores (4-8) it can easily made multithreaded.  For
 many-cores like Intel Xeon Phi this is a bit more
 >> sophisticated but still not too
 hard.  The demo I posted does not use micro kernels that
 exploit SSE, AVX or
 >> FMA
 instructions.  With that the matrix product is on par with
 Intel MKL.  Just like BLIS. For my platforms I wrote
 >> my own micro-kernels but the interface
 of function ugemm is compatible to BLIS.
 >>
 >> Maybe you
 could help me to integrate your code in the benchmark
 example I posted above.
 >>
 >> About Blaze:  Do they have their own
 implementation of a matrix-matrix product?  It seems to
 require a
 >> tuned BLAS implementation
 (?Otherwise you get only poor performance?) for the
 matrix-matrix product.
 >> IMHO they
 only have tuned the ?easy? stuff like BLAS Level1 and
 Level2.  In that case it makes more
 >> sense to compare the performance with
 the actual underlying GEMM implementation.  But if I am
 wrong,
 >> let me know.
 >>
 >> About the
 block size: In my experience you get better performance if
 you chose them dynamically at runtime
 >> depending on the problem size. 
 Depending on the architecture you can just specify ranges
 like 256 - 384 for
 >> blocking factor
 MC.  In my code it then also needs to satisfy the
 restriction that it can be divided by factor MR.
 >> I know that doing things at compile
 time is a C++ fetish.  But the runtime overhead is
 negligible and having
 >> blocks of
 similar sizes easily pays of.
 >>
 >> Cheers,
 >>
 >> Michael
 >>
 *************************************
 >>
 >>
 >>
 >>
 >>
 _______________________________________________
 >> ublas mailing list
 >> [hidden email]
 >> http://lists.boost.org/mailman/listinfo.cgi/ublas
 >> Sent to: [hidden email]
 >
 >
 >
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Oswin Krause
In reply to this post by palik imre
Hi,

I would still vote for the route to rewrite uBLAS based on BLAS bindings
and providing a reasonable default implementation that also works well
without memory assumptions.The main reason is that only having a fast
gemm implementation does not really improve things, given that BLAS
level 3 is a quite large beast.

Im still willing to donate my partial uBLAS rewrite, unfortunately I am
a bit short on time to polish it(just finished my phd and have a huge
load of work on my desk). But if someone opened a git-branch for that i
could try to make the code ready (porting my implementation back to
boost namespaces etc).


On 2016-01-23 18:53, palik imre wrote:

> Hi All,
>
> what's next?  I mean what is the development process for ublas?
>
> Now we have a C-like implementation that outperforms both the
> mainline, and the branch version (axpy_prod).  What will we do with
> that?
>
> As far as I see we have the following options:
>
> 1) Create a C++ template magic implementation out of it.  But for
> this, at the least we would need compile-time access to the target
> instruction set.  Any idea how to do that?
>
> 2) Create a compiled library implementation out of it, and choose the
> implementation run-time based on the CPU capabilities.
>
> 3) Include some good defaults/defines, and hope the user will use
> them.
>
> 4) Don't include it, and do something completely different.
>
> What do you think?
>
> Cheers,
>
> Imre
>
> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Riccardo Rossi
In reply to this post by palik imre
isn't the command

alignof(...)

of c++11 what you are looking for?

cheers
Riccardo

On Sun, Jan 24, 2016 at 8:50 AM, palik imre <[hidden email]> wrote:
Hi Michael,

I had a look on your AVX microkernel on my old AMD box.  Congratulations,  it is twice as fast as what I managed to get out of my optimised C kernel.

I wonder if I can catch up with that using newer gcc + SIMD arrays or intrinsics.

Cheers,

Imre

--------------------------------------------
On Fri, 22/1/16, Michael Lehn <[hidden email]> wrote:

 Subject: Re: [ublas] Matrix multiplication performance
 To: "palik imre" <[hidden email]>
 Cc: "ublas mailing list" <[hidden email]>
 Date: Friday, 22 January, 2016, 21:27

 Wow Imre!

 Ok, that is actually a
 significant difference :-)

 I have just added a new version to my site. 
 Unfortunately the computer I used for creating
 the page does not have posix_memalign().  So I
 had to hack my own function for that.  But
 I think for a proof of concept it will do.

 But most of all I also added
 micro-kernel that are (fairly) optimised for AVX and FMA. 
 The
 micro-kernel for AVX require MR=4,
 NR=8.  For FMA it requires MR=4, NR=16.  Otherwise
 the reference implementation gets selected. 
 For all parameters default values now can be
 overwritten when compiling, e.g.

     g++ -O3 -Wall
 -std=c++11 -DHAVE_AVX  -DBS_D_MC=512  matprod.cc

 The optimised micro kernels
 are only included when compiled with -DHAVE_AVX or
 -DHAVE_FMA

 I put all this
 stuff here

     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html

 also the tar-ball

     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz

 contains all required
 files:

    
 session2/avx.hpp
     session2/fma.hpp
     session2/gemm.hpp
    
 session2/matprod.cc

 Of
 course I would be interested how all this performs on other
 platforms.


 Cheers,

 Michael


 On 22
 Jan 2016, at 16:20, palik imre <[hidden email]>
 wrote:

 > Hi All,
 >
 > In the meantime I
 enabled avx2 ...
 >
 >
 Theoretical CPU performance maximum: clock * vector-size *
 ALUs/Core * 2 ~ 69.6GFLOPS
 > So it is at
 ~25%
 >
 > Compiler is
 gcc 4.8.3
 >
 > vanilla
 run:
 > $ g++ -Wall -W -std=gnu++11 -Ofast
 -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o
 matprod matprod.cc
 > $ ./matprod
 > #   m     n 
    k  uBLAS:   t1   
    MFLOPS   Blocked:   t2 
     MFLOPS        Diff nrm1
 >   100   100   100 
 0.000350402      5707.73      0.00116104     
 1722.59               0
 >   200   200   200   0.00445094 
     3594.74      0.00819996      1951.23       
        0
 >   300   300   300 
   0.0138515      3898.49       0.0266515 
     2026.15     1.06987e-06
 >   400   400   400 
   0.0266447      4803.96       0.0613506 
     2086.37     4.01475e-06
 >   500   500   500 
   0.0424372      5891.06        0.119345     
 2094.77     8.99605e-06
 >   600   600   600 
   0.0724648      5961.51        0.203187     
 2126.12     1.63618e-05
 >   700   700   700 
    0.115464      5941.25        0.325834 
     2105.36     2.69547e-05
 >   800   800   800 
    0.173003      5918.98        0.480655 
     2130.43     4.09449e-05
 >   900   900   900 
    0.248077       5877.2       
 0.689972      2113.13     5.87376e-05
 >  1000  1000  1000      0.33781   
   5920.49        0.930591      2149.17 
    8.16264e-05
 >  1100 
 1100  1100       0.5149      5169.93   
      1.25507         2121 
    0.000108883
 >  1200 
 1200  1200     0.670628      5153.38     
    1.62732      2123.74 
    0.000141876
 >  1300 
 1300  1300     0.852692      5153.09     
    2.06708      2125.71 
    0.000180926
 >  1400 
 1400  1400      1.06695      5143.65     
    2.56183      2142.22 
    0.000225975
 >  1500 
 1500  1500       1.3874   
    4865.2         3.16532     
 2132.49     0.000278553
 > 
 1600  1600  1600      1.77623      4612.03     
     3.8137      2148.05     0.000338106
 >  1700  1700  1700   
    2.3773      4133.26     
    4.56665      2151.69 
    0.000404458
 >  1800 
 1800  1800      3.06381      3807.03     
    5.40317      2158.73      0.00048119
 >  1900  1900  1900   
    3.9039      3513.92     
    6.37295      2152.53 
    0.000564692
 >  2000 
 2000  2000      4.79166      3339.13     
    7.43399      2152.28 
    0.000659714
 >  2100 
 2100  2100      6.04946      3061.76     
    8.62429      2147.65 
    0.000762223
 >  2200 
 2200  2200      7.39085       2881.4   
      9.86237      2159.32 
    0.000875624
 >  2300 
 2300  2300      9.00453      2702.42     
    11.2513      2162.78      0.00100184
 >  2400  2400  2400      10.3952   
   2659.68         12.7491      2168.62 
     0.00113563
 >  2500  2500  2500 
     12.2283      2555.55     
    14.4615      2160.92      0.00128336
 >  2600  2600  2600      13.8912   
   2530.51         16.1965      2170.34 
     0.00144304
 >  2700  2700  2700 
      15.391      2557.72     
    18.1998      2162.99      0.00161411
 >  2800  2800  2800      17.5673   
   2499.19         20.2171      2171.63 
     0.00180035
 >  2900  2900  2900 
     19.4621      2506.31     
    22.5482      2163.28      0.00199765
 >  3000  3000  3000      21.4506   
   2517.42         24.9477      2164.53 
     0.00221028
 >  3100  3100  3100 
       23.71      2512.95     
    27.5144      2165.48      0.00243877
 >  3200  3200  3200      25.9051   
   2529.85         30.2816      2164.22 
     0.00267766
 >  3300  3300  3300 
     28.1949      2549.18          33.176     
 2166.45      0.00293379
 >  3400 
 3400  3400      30.7235      2558.56     
    36.0156      2182.61   
    0.0032087
 >  3500  3500 
 3500      34.0419      2518.95     
    39.3929      2176.79      0.00349827
 >  3600  3600  3600      37.0562   
   2518.12         42.7524      2182.62 
     0.00380447
 >  3700  3700  3700 
     39.7885      2546.11     
    46.4748      2179.81      0.00412621
 >  3800  3800  3800      43.6607   
   2513.56         50.2119      2185.62 
      0.0044694
 >  3900 
 3900  3900      46.5104      2550.78     
    54.4822      2177.56      0.00482355
 >  4000  4000  4000      50.6098   
   2529.15         58.7686      2178.03 
     0.00520289
 >
 >
 tuned run:
 >
 > $ g++
 -Wall -W -std=gnu++11 -Ofast -march=core-avx2
 -mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod2
 matprod2.cc
 > $ ./matprod2
 > #   m     n 
    k  uBLAS:   t1   
    MFLOPS   Blocked:   t2 
     MFLOPS        Diff nrm1
 >   100   100   100 
 0.000351671      5687.13     0.000316612   
   6316.88               0
 >   200   200   200   0.00419531 
     3813.78      0.00159044      10060.1       
        0
 >   300   300   300 
   0.0141153      3825.62      0.00421113     
 12823.2     1.07645e-06
 >   400   400   400 
   0.0291599      4389.59      0.00858138       
 14916     4.00614e-06
 >   500   500   500 
   0.0483492      5170.72       0.0166519 
     15013.3     8.96808e-06
 >   600   600   600 
   0.0725783      5952.19       0.0279634 
     15448.7     1.63386e-05
 >   700   700   700 
    0.113891      6023.29        0.043077 
       15925     2.69191e-05
 >   800   800   800 
    0.171416      5973.79   
    0.0627796        16311 
    4.09782e-05
 >   900   900   900 
    0.243677      5983.32   
    0.0922766      15800.3 
    5.88092e-05
 >  1000 
 1000  1000     0.335158      5967.33     
   0.123339      16215.5     8.15988e-05
 >  1100  1100  1100 
    0.515776      5161.15     
    0.16578      16057.5 
    0.000108991
 >  1200 
 1200  1200     0.662706      5214.98     
   0.205989      16777.6     0.000141824
 >  1300  1300  1300 
    0.845952      5194.15     
    0.27637        15899      0.00018111
 >  1400  1400  1400      1.06712   
   5142.82        0.332118      16524.2 
    0.000225958
 >  1500 
 1500  1500      1.38147      4886.11       
 0.409224      16494.6     0.000278265
 >  1600  1600  1600      1.72238   
   4756.21        0.492314      16639.8 
    0.000338095
 >  1700 
 1700  1700      2.38508      4119.77       
 0.603566      16279.9     0.000404362
 >  1800  1800  1800      3.12034   
   3738.05        0.717409      16258.5 
    0.000481575
 >  1900 
 1900  1900      3.93668      3484.66       
 0.824933      16629.2     0.000564727
 >  2000  2000  2000      4.76038   
   3361.07        0.941643      16991.6     
 0.00065862
 >  2100  2100  2100     
 5.90627      3135.99         1.12226   
   16504.2     0.000762307
 >  2200  2200  2200      7.26419   
   2931.64         1.28213      16609.9 
    0.000876699
 >  2300 
 2300  2300      8.88171      2739.79     
    1.45247      16753.5      0.00100222
 >  2400  2400  2400      10.4956   
   2634.26         1.62705      16992.7 
     0.00113566
 >  2500  2500  2500 
      11.913      2623.18     
    1.87499      16666.7      0.00128371
 >  2600  2600  2600      13.7057   
   2564.77          2.1156      16615.6     
 0.00144259
 >  2700  2700  2700     
 15.5959      2524.13         2.33957   
   16826.1      0.00161501
 >  2800 
 2800  2800      17.1121      2565.67     
    2.57445      17053.8      0.00179901
 >  2900  2900  2900      19.4167   
   2512.16         2.92445      16679.4 
     0.00199764
 >  3000  3000  3000 
     21.3239      2532.37     
    3.18891      16933.7      0.00220999
 >  3100  3100  3100      23.5049   
   2534.88          3.5305      16876.4     
 0.00243845
 >  3200  3200  3200     
 25.7362      2546.45         3.81708   
   17169.1      0.00267581
 >  3300 
 3300  3300      28.4467      2526.62     
    4.25869        16877      0.00293513
 >  3400  3400  3400      30.4607   
   2580.63         4.67999      16796.6 
     0.00320688
 >  3500  3500  3500 
     33.7737      2538.96     
    5.04289      17004.1      0.00349667
 >  3600  3600  3600      36.9633   
   2524.45           5.414      17235.3 
     0.00380237
 >  3700  3700  3700 
     39.5153      2563.71     
    6.04875      16748.2      0.00412583
 >  3800  3800  3800      42.9412   
   2555.68         6.48985      16910.1 
     0.00446785
 >  3900  3900  3900 
     46.5282      2549.81     
    7.05844        16808      0.00482701
 >  4000  4000  4000      50.2218   
   2548.69         7.42442      17240.4 
     0.00520272
 >
 >
 As the generated machine code is completely different, I
 guess gcc notices the aligned alloc, and uses the alignment
 information for optimisation.
 >
 > Cheers,
 >
 > Imre
 >
 >
 > On Friday, 22
 January 2016, 15:09, Michael Lehn <[hidden email]>
 wrote:
 >
 >
 > Hi Imre,
 >
 > thanks for running the benchmarks.  Of
 course you are right that using aligned memory for the
 buffers improves
 > performance.  I also
 did not really put any effort in optimising the parameters
 MC, NC, KC, MR and NR.  I will
 > compare
 different variants and report them on the website
 >
 >     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
 >
 > I modified my
 benchmark program such that it also computes the FLOPS as
 >
 >     FLOPS =
 2*m*n*k/time_elpased
 >
 > See
 >
 >     http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
 >
 > Could you re-run
 your benchmarks and post the different MFLOPS you get? That
 is important for actually tuning thing.
 >
 On my machine my code only reaches 20% of the peak
 performance (about 5 GFLOPS instead of 25.6
 GFLOPS).   So
 > a speedup of
 2.5 would be impressive but still far from peak
 performance.
 >
 >
 Cheers,
 >
 >
 Michael
 >
 >
 > On 22 Jan 2016, at 11:03, palik imre
 <[hidden email]>
 wrote:
 >
 >> Sorry
 for posting twice more or less the same thing.  I got
 confused with javascript interfaces.
 >>
 >> It seems I
 also forgot to enable avx for my last measurements.  With
 that + my blocking and alignment changes, performance
 according to my tests is something like 250% higher than
 running Michael's original code (with avx).
 >>
 >> Cheers,
 >>
 >> Imre
 >>
 >>
 >> On Friday, 22 January 2016, 10:33,
 palik imre <[hidden email]>
 wrote:
 >>
 >>

 >> Hi Michael,
 >>
 >> your
 blocksizes are far from optimal.  MR & NR should be
 multiples of the L1 cache line size (i.e. 16 for double on
 Intel).  Also, the blocks should be allocated aligned to L1
 cache lines (e.g., via posix_memalign()).
 >>
 >> This alone
 brought something like 50% speedup for my square matrix
 test.
 >>
 >> I
 will have a look at the other parameters + the whole thing
 via perf during the weekend.
 >>
 >> Cheers,
 >>
 >> Imre
 >>
 >>
 >>
 >> On Friday, 22 January 2016, 0:28,
 "[hidden email]"
 <[hidden email]>
 wrote:
 >>
 >>

 >> Subject: Re: [ublas] Matrix
 multiplication performance
 >>
 Message-ID: <[hidden email]>
 >> Content-Type: text/plain;
 charset="windows-1252"
 >>

 >> Hi Nasos,
 >>

 >> first of all I don?t want to take
 wrong credits and want to point out that this is not my
 algorithm.  It is based on
 >>
 >>     http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
 >>
 >> 
    https://github.com/flame/blis
 >>
 >> For a few
 cores (4-8) it can easily made multithreaded.  For
 many-cores like Intel Xeon Phi this is a bit more
 >> sophisticated but still not too
 hard.  The demo I posted does not use micro kernels that
 exploit SSE, AVX or
 >> FMA
 instructions.  With that the matrix product is on par with
 Intel MKL.  Just like BLIS. For my platforms I wrote
 >> my own micro-kernels but the interface
 of function ugemm is compatible to BLIS.
 >>
 >> Maybe you
 could help me to integrate your code in the benchmark
 example I posted above.
 >>
 >> About Blaze:  Do they have their own
 implementation of a matrix-matrix product?  It seems to
 require a
 >> tuned BLAS implementation
 (?Otherwise you get only poor performance?) for the
 matrix-matrix product.
 >> IMHO they
 only have tuned the ?easy? stuff like BLAS Level1 and
 Level2.  In that case it makes more
 >> sense to compare the performance with
 the actual underlying GEMM implementation.  But if I am
 wrong,
 >> let me know.
 >>
 >> About the
 block size: In my experience you get better performance if
 you chose them dynamically at runtime
 >> depending on the problem size. 
 Depending on the architecture you can just specify ranges
 like 256 - 384 for
 >> blocking factor
 MC.  In my code it then also needs to satisfy the
 restriction that it can be divided by factor MR.
 >> I know that doing things at compile
 time is a C++ fetish.  But the runtime overhead is
 negligible and having
 >> blocks of
 similar sizes easily pays of.
 >>
 >> Cheers,
 >>
 >> Michael
 >>
 *************************************
 >>
 >>
 >>
 >>
 >>
 _______________________________________________
 >> ublas mailing list
 >> [hidden email]
 >> http://lists.boost.org/mailman/listinfo.cgi/ublas
 >> Sent to: [hidden email]
 >
 >
 >
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



--

Riccardo Rossi

PhD, Civil Engineer


member of the Kratos Team: www.cimne.com/kratos

lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)

Research fellow at International Center for Numerical Methods in Engineering (CIMNE)


C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9

08034 – Barcelona – Spain – www.cimne.com  - 

T.(+34) 93 401 56 96 skype: rougered4

 

Les dades personals contingudes en aquest missatge són tractades amb la finalitat de mantenir el contacte professional entre CIMNE i voste. Podra exercir els drets d'accés, rectificació, cancel·lació i oposició, dirigint-se a [hidden email]. La utilització de la seva adreça de correu electronic per part de CIMNE queda subjecte a les disposicions de la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç Electronic.

 Imprimiu aquest missatge, només si és estrictament necessari.


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Michael Lehn-2
In reply to this post by Oswin Krause

On 24 Jan 2016, at 10:18, Oswin Krause <[hidden email]> wrote:

Hi,

I would still vote for the route to rewrite uBLAS based on BLAS bindings and providing a reasonable default implementation that also works well without memory assumptions.The main reason is that only having a fast gemm implementation does not really improve things, given that BLAS level 3 is a quite large beast.


Just a footnote to that point.  Once you have a fast GEMM the rest of BLAS level 3 is not such a long road.  In particular
all the performance of SYMM, HEMM, TRMM, SYRK, … just depends on the performance of the ugemm micro kernel.

For example in SYMM C = A*B you consider MCxMC blocks of the symmetric matrix A and MCxNC blocks of B.
Multiplication is done clockwise by packing blocks first in buffers blockA and blockB.  If the elements of A are stored
in the upper triangular part you have three cases:

(1) The block is completely in the upper part and you pack it using the GEMM-pack
(2) The block is completely in the lower part and you pack its transposed using the GEMM-pack
(3) The block is on the diagonal.  So you need an extra SYMM-pack routine that also packs the “virtual” elements (so 20 lines of code)

But after packing you can use the GEMM macro kernel (and thereby the GEMM micro kernel).



Im still willing to donate my partial uBLAS rewrite, unfortunately I am a bit short on time to polish it(just finished my phd and have a huge load of work on my desk). But if someone opened a git-branch for that i could try to make the code ready (porting my implementation back to boost namespaces etc).


On 2016-01-23 18:53, palik imre wrote:
Hi All,
what's next?  I mean what is the development process for ublas?
Now we have a C-like implementation that outperforms both the
mainline, and the branch version (axpy_prod).  What will we do with
that?
As far as I see we have the following options:
1) Create a C++ template magic implementation out of it.  But for
this, at the least we would need compile-time access to the target
instruction set.  Any idea how to do that?
2) Create a compiled library implementation out of it, and choose the
implementation run-time based on the CPU capabilities.
3) Include some good defaults/defines, and hope the user will use
them.
4) Don't include it, and do something completely different.
What do you think?
Cheers,
Imre
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Michael Lehn-2


Just a footnote to that point.  Once you have a fast GEMM the rest of BLAS level 3 is not such a long road.  In particular
all the performance of SYMM, HEMM, TRMM, SYRK, … just depends on the performance of the ugemm micro kernel.

For example in SYMM C = A*B you consider MCxMC blocks of the symmetric matrix A and MCxNC blocks of B.
Multiplication is done clockwise by packing blocks first in buffers blockA and blockB.  If the elements of A are stored
in the upper triangular part you have three cases:

(1) The block is completely in the upper part and you pack it using the GEMM-pack
(2) The block is completely in the lower part and you pack its transposed using the GEMM-pack
(3) The block is on the diagonal.  So you need an extra SYMM-pack routine that also packs the “virtual” elements (so 20 lines of code)

But after packing you can use the GEMM macro kernel (and thereby the GEMM micro kernel).

Some more notes on advantages of a C++ BLAS implementation:

(a) mixed precision arithmetic

For C = beta*C + alpha*A*B matrices can have differente element types as long as the element wise
operation C(i,j) = beta*C(i,j) + alpha*A(i,l)*B(l,j) is supported.  

Because elements anyway get copied into buffers you can upcast them to a common type.  The macro-kernel
then operates on homogeneous types.  So if the "common type of alpha, A, B" is double it runs with the performance
of dgemm.   By the way that already works in the example I posted.  

(b) Getting rid of col-major vs row-major

Packing is only a O(n^2) operation.  After packing elements are in a cache-friendly format in memory.  In my
code you support

(i) col-major with incRow=1, incCol=numRows
(ii) row-major with incRow=numCols, incCol=1

But it is not required that either incRow or incCol is 1.  That is nice if you matrix actually references every second
row and every third column.  In matlab notation that would be A(1:2:m, 1:3:n).  For the performance that does not
matter …

(c) Matrix-expressions

Even if a matrix is actually an expression that can be evaluated component wise like A = A1+ A2 the product
C = (A1+A2)*(B1+B2) can be computed with the same performance as with evaluated matrices C = A*B as
long as the expression can be evaluated with O(1).

Features like (a) and (b) are supported in BLIS.  But as they are using C99 the implementation using macros
instead of simple template functions is not so easy to understand (IMHO).  By default they have disable these
features due to long compile times.  In C++ this comes for free.  Or more precisely it comes for free for us as
the ugly part is done by the compiler …

Anyway, I think what C++ would need is its own "BLAS C++ Standard Interface” for low-level functionality.  This
can be supported by simple reference implementations or sophisticated high-performance implementations.  But
we would have a common ground on which one can build high-level C++ libraries.







Im still willing to donate my partial uBLAS rewrite, unfortunately I am a bit short on time to polish it(just finished my phd and have a huge load of work on my desk). But if someone opened a git-branch for that i could try to make the code ready (porting my implementation back to boost namespaces etc).


On 2016-01-23 18:53, palik imre wrote:
Hi All,
what's next?  I mean what is the development process for ublas?
Now we have a C-like implementation that outperforms both the
mainline, and the branch version (axpy_prod).  What will we do with
that?
As far as I see we have the following options:
1) Create a C++ template magic implementation out of it.  But for
this, at the least we would need compile-time access to the target
instruction set.  Any idea how to do that?
2) Create a compiled library implementation out of it, and choose the
implementation run-time based on the CPU capabilities.
3) Include some good defaults/defines, and hope the user will use
them.
4) Don't include it, and do something completely different.
What do you think?
Cheers,
Imre
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Karl Meerbergen-2

Anyway, I think what C++ would need is its own "BLAS C++ Standard Interface” for low-level functionality.  This
can be supported by simple reference implementations or sophisticated high-performance implementations.  But
we would have a common ground on which one can build high-level C++ libraries.

I agree, but there is no std type/concept for a matrix. At this stage, it is impossible for developers of linear algebra to write libraries in C++ since their users have to use the underlying library for their applications. With the bindings, we have tried to suggest such a common ground. I once (many years ago) started to build a library from the bindings, but gave up because of time constraints …

Best,

Karl


Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm for more information.

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Riccardo Rossi
In reply to this post by Oswin Krause

I would like to say that as a user, native implementation is way better than bindings.

Blas is always a nasty dependence to have in a project...and headeronly libs are the ideal situation.

I think for example this justifies the success of eigen over alternatives.

Cheers
Riccardo

On 24 Jan 2016 10:18, "Oswin Krause" <[hidden email]> wrote:
Hi,

I would still vote for the route to rewrite uBLAS based on BLAS bindings and providing a reasonable default implementation that also works well without memory assumptions.The main reason is that only having a fast gemm implementation does not really improve things, given that BLAS level 3 is a quite large beast.

Im still willing to donate my partial uBLAS rewrite, unfortunately I am a bit short on time to polish it(just finished my phd and have a huge load of work on my desk). But if someone opened a git-branch for that i could try to make the code ready (porting my implementation back to boost namespaces etc).


On 2016-01-23 18:53, palik imre wrote:
Hi All,

what's next?  I mean what is the development process for ublas?

Now we have a C-like implementation that outperforms both the
mainline, and the branch version (axpy_prod).  What will we do with
that?

As far as I see we have the following options:

1) Create a C++ template magic implementation out of it.  But for
this, at the least we would need compile-time access to the target
instruction set.  Any idea how to do that?

2) Create a compiled library implementation out of it, and choose the
implementation run-time based on the CPU capabilities.

3) Include some good defaults/defines, and hope the user will use
them.

4) Don't include it, and do something completely different.

What do you think?

Cheers,

Imre

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Karl Meerbergen-2
For academic research this is not an issue, but it is an issue where a little performance loss plays a role. I know one person who complained about Eigen for that reason: it was hard to get performance back of old code that used tuned BLAS libraries.

Living with headers only is a tough limitation. The idea that someone wil rewrite all available software to headers only does not make sense. It would imply rewriting many good libraries such as LAPACK, SLATEC, MUMPS, UMFPACK, QUADPACK, PARDISO, ARPACK, etc. If you want to solve a dense linear system, just use LAPACK. Most machines have compiled lapack and blas anyway.

We need both native implementations and bindings. Native implementations are nice to try out a concept, bindings are needed for specific functionalities, sometimes for performance. And, with a good build system, once the software packages are set up, linking is pretty straightforward.

Best,

Karl



On 24 Jan 2016, at 14:59, Riccardo Rossi <[hidden email]> wrote:

I would like to say that as a user, native implementation is way better than bindings.

Blas is always a nasty dependence to have in a project...and headeronly libs are the ideal situation.

I think for example this justifies the success of eigen over alternatives.

Cheers
Riccardo

On 24 Jan 2016 10:18, "Oswin Krause" <[hidden email]> wrote:
Hi,

I would still vote for the route to rewrite uBLAS based on BLAS bindings and providing a reasonable default implementation that also works well without memory assumptions.The main reason is that only having a fast gemm implementation does not really improve things, given that BLAS level 3 is a quite large beast.

Im still willing to donate my partial uBLAS rewrite, unfortunately I am a bit short on time to polish it(just finished my phd and have a huge load of work on my desk). But if someone opened a git-branch for that i could try to make the code ready (porting my implementation back to boost namespaces etc).


On 2016-01-23 18:53, palik imre wrote:
Hi All,

what's next?  I mean what is the development process for ublas?

Now we have a C-like implementation that outperforms both the
mainline, and the branch version (axpy_prod).  What will we do with
that?

As far as I see we have the following options:

1) Create a C++ template magic implementation out of it.  But for
this, at the least we would need compile-time access to the target
instruction set.  Any idea how to do that?

2) Create a compiled library implementation out of it, and choose the
implementation run-time based on the CPU capabilities.

3) Include some good defaults/defines, and hope the user will use
them.

4) Don't include it, and do something completely different.

What do you think?

Cheers,

Imre

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm for more information.

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Michael Lehn-2
In reply to this post by Karl Meerbergen-2

On 24 Jan 2016, at 13:11, Karl Meerbergen <[hidden email]> wrote:


Anyway, I think what C++ would need is its own "BLAS C++ Standard Interface” for low-level functionality.  This
can be supported by simple reference implementations or sophisticated high-performance implementations.  But
we would have a common ground on which one can build high-level C++ libraries.

I agree, but there is no std type/concept for a matrix. At this stage, it is impossible for developers of linear algebra to write libraries in C++ since their users have to use the underlying library for their applications. With the bindings, we have tried to suggest such a common ground. I once (many years ago) started to build a library from the bindings, but gave up because of time constraints …


I completely understand the problem with the time constraints.  But I hope that once there is a common ground
we can take advantage from the fact that different people have different constraints.  For myself the primary
constraint for producing code is the possibility to use it for teaching.  Otherwise I sooner or later don’t have really
time to maintain it.  Like Imre pointed out my code for the matrix product is “C++ code in C style”.  That has the
following background.  One of my lecture is an underground class “Introduction to HPC”.  Students attending this
calls have different majors (e.g. math, physics, engineering) and therefore different backgrounds.  Some take the
class before they had numerical linear algebra,  others never coded in C/C++.  So the class really has to start at
zero: Basics of hardware architectures and introduction to assembly (Intel 64), introduction to C … The goal of the
class is having a blocked implementation of the LU factorization (same algorithm like DGETRF).  So that requires
some BLAS Level 1, 2 functions and optimized DGEMM, DTRSM functions.  GEMM Micro-Kernels are optimized
for the machines the students actually use.  So this does not sound much.  But the students have to code each line
themselves.  So that is only possible to restrict things to a “primitive” level.  After all it is not a general class on
computer science but on HPC.  So having at the end of the semester a LU factorization that can compete with MKL
DGETRF is (in this case) more important than a general implementation.

So long story short.  My job allows that I can provide examples for special cases.  In particular I can provide
self-contained proof of concept examples.

This semester we also offer master-level class on HPC.  Here we also give an introduction on C++.  But again I don’t
want that students use a ready-to-use library.  All the stuff like matrix class have to be coded by themselves.  IMHO
if students have coded their simple-but-own matrix library then they are in the long-run better prepared to use standard
libraries.  Also for OpenMP:  Before they “just use” OpenMP one project is writing their own thread-pool so that they
see how OpenMP works internally.

So long story short.  Even in the future we will just use “vim + gcc + g++”.  But I hope that students later are able to
use and contribute to “mainstream C++ libraries” in this domain.

************************************************************************************************************************************
One more for proof-of-concept examples:  For BLAS I have a *complete* header-only C++ implementation.  For near
peak performance it “only” requires an optimized ugemm for float, double, complex<float> and complex<double>.
************************************************************************************************************************************

But anyway, the first step would be to specify a common *low-level* C++ standard for BLAS.  In my opinion that
should be independent of any matrix-classes.  For the my C++ BLAS implementation the GEMM signature is
this:


I know this really looks primitive.  But you have experience with bindings to BLAS.  At one point you have to go to
details like pointers and matrix-formats.  I also don’t use enums for “Trans”, “NoTrans”, “Lower”, …  Every C++
matrix library has its own types for that.  So you can save writing traits for converting these. In order to be successful
this has to be attractive for other widely used C++ libraries like MTL4.  

And of course, for any C++ BLAS implementation we can offer a C and Fortran interface.  Instead of the way around ...






_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Riccardo Rossi
In reply to this post by Karl Meerbergen-2

Dear Karl,
   Let me disagree just about your last point:

Linking blas and lapack IS a big pain for any moderately large project.

Just checking in the mailing lists for example of the trilinos project and you ll readily verify my statement.

In particular I consider the use of static libs and of fortran for the blas libs particularly nasty since errors are at link or runtime instead at compiletime.
It gave us very nasty problems in our windows ports.

Cheers
Riccardo

On 24 Jan 2016 15:34, "Karl Meerbergen" <[hidden email]> wrote:
For academic research this is not an issue, but it is an issue where a little performance loss plays a role. I know one person who complained about Eigen for that reason: it was hard to get performance back of old code that used tuned BLAS libraries.

Living with headers only is a tough limitation. The idea that someone wil rewrite all available software to headers only does not make sense. It would imply rewriting many good libraries such as LAPACK, SLATEC, MUMPS, UMFPACK, QUADPACK, PARDISO, ARPACK, etc. If you want to solve a dense linear system, just use LAPACK. Most machines have compiled lapack and blas anyway.

We need both native implementations and bindings. Native implementations are nice to try out a concept, bindings are needed for specific functionalities, sometimes for performance. And, with a good build system, once the software packages are set up, linking is pretty straightforward.

Best,

Karl



On 24 Jan 2016, at 14:59, Riccardo Rossi <[hidden email]> wrote:

I would like to say that as a user, native implementation is way better than bindings.

Blas is always a nasty dependence to have in a project...and headeronly libs are the ideal situation.

I think for example this justifies the success of eigen over alternatives.

Cheers
Riccardo

On 24 Jan 2016 10:18, "Oswin Krause" <[hidden email]> wrote:
Hi,

I would still vote for the route to rewrite uBLAS based on BLAS bindings and providing a reasonable default implementation that also works well without memory assumptions.The main reason is that only having a fast gemm implementation does not really improve things, given that BLAS level 3 is a quite large beast.

Im still willing to donate my partial uBLAS rewrite, unfortunately I am a bit short on time to polish it(just finished my phd and have a huge load of work on my desk). But if someone opened a git-branch for that i could try to make the code ready (porting my implementation back to boost namespaces etc).


On 2016-01-23 18:53, palik imre wrote:
Hi All,

what's next?  I mean what is the development process for ublas?

Now we have a C-like implementation that outperforms both the
mainline, and the branch version (axpy_prod).  What will we do with
that?

As far as I see we have the following options:

1) Create a C++ template magic implementation out of it.  But for
this, at the least we would need compile-time access to the target
instruction set.  Any idea how to do that?

2) Create a compiled library implementation out of it, and choose the
implementation run-time based on the CPU capabilities.

3) Include some good defaults/defines, and hope the user will use
them.

4) Don't include it, and do something completely different.

What do you think?

Cheers,

Imre

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm for more information.

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Oswin Krause
Hi,

the obvious solution to this is to link to the c-cindings (cBLAS,
LAPACKE). While BLAS is truly a PITA, most systems have quite sane cBLAS
bindings (notable exception is OpenBLAS...) and often it is enough to
check whether the system has some libcBLAS.so somewhere in the path.

the big issue of having a header only BLAS are the long, long, long
compile times.

On 2016-01-24 16:00, Riccardo Rossi wrote:

> Dear Karl,
>    Let me disagree just about your last point:
>
> Linking blas and lapack IS a big pain for any moderately large
> project.
>
> Just checking in the mailing lists for example of the trilinos project
> and you ll readily verify my statement.
>
> In particular I consider the use of static libs and of fortran for the
> blas libs particularly nasty since errors are at link or runtime
> instead at compiletime.
> It gave us very nasty problems in our windows ports.
>
> Cheers
> Riccardo
>
> On 24 Jan 2016 15:34, "Karl Meerbergen"
> <[hidden email]> wrote:
>
>> For academic research this is not an issue, but it is an issue where
>> a little performance loss plays a role. I know one person who
>> complained about Eigen for that reason: it was hard to get
>> performance back of old code that used tuned BLAS libraries.
>>
>> Living with headers only is a tough limitation. The idea that
>> someone wil rewrite all available software to headers only does not
>> make sense. It would imply rewriting many good libraries such as
>> LAPACK, SLATEC, MUMPS, UMFPACK, QUADPACK, PARDISO, ARPACK, etc. If
>> you want to solve a dense linear system, just use LAPACK. Most
>> machines have compiled lapack and blas anyway.
>>
>> We need both native implementations and bindings. Native
>> implementations are nice to try out a concept, bindings are needed
>> for specific functionalities, sometimes for performance. And, with a
>> good build system, once the software packages are set up, linking is
>> pretty straightforward.
>>
>> Best,
>>
>> Karl
>>
>> On 24 Jan 2016, at 14:59, Riccardo Rossi <[hidden email]>
>> wrote:
>>
>> I would like to say that as a user, native implementation is way
>> better than bindings.
>>
>> Blas is always a nasty dependence to have in a project...and
>> headeronly libs are the ideal situation.
>>
>> I think for example this justifies the success of eigen over
>> alternatives.
>>
>> Cheers
>> Riccardo
>>
>> On 24 Jan 2016 10:18, "Oswin Krause"
>> <[hidden email]> wrote:
>> Hi,
>>
>> I would still vote for the route to rewrite uBLAS based on BLAS
>> bindings and providing a reasonable default implementation that also
>> works well without memory assumptions.The main reason is that only
>> having a fast gemm implementation does not really improve things,
>> given that BLAS level 3 is a quite large beast.
>>
>> Im still willing to donate my partial uBLAS rewrite, unfortunately I
>> am a bit short on time to polish it(just finished my phd and have a
>> huge load of work on my desk). But if someone opened a git-branch
>> for that i could try to make the code ready (porting my
>> implementation back to boost namespaces etc).
>>
>> On 2016-01-23 18:53, palik imre wrote:
>> Hi All,
>>
>> what's next?  I mean what is the development process for ublas?
>>
>> Now we have a C-like implementation that outperforms both the
>> mainline, and the branch version (axpy_prod).  What will we do with
>> that?
>>
>> As far as I see we have the following options:
>>
>> 1) Create a C++ template magic implementation out of it.  But for
>> this, at the least we would need compile-time access to the target
>> instruction set.  Any idea how to do that?
>>
>> 2) Create a compiled library implementation out of it, and choose
>> the
>> implementation run-time based on the CPU capabilities.
>>
>> 3) Include some good defaults/defines, and hope the user will use
>> them.
>>
>> 4) Don't include it, and do something completely different.
>>
>> What do you think?
>>
>> Cheers,
>>
>> Imre
>>
>> _______________________________________________
>> ublas mailing list
>> [hidden email]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: [hidden email]
>> _______________________________________________
>> ublas mailing list
>> [hidden email]
>> http://lists.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: [hidden email]
>  _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]
>
> Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm for more
> information.
>
> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]
>
> _______________________________________________
> ublas mailing list
> [hidden email]
> http://lists.boost.org/mailman/listinfo.cgi/ublas
> Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Michael Lehn-2
On 24 Jan 2016, at 16:17, Oswin Krause <[hidden email]> wrote:

Hi,

the obvious solution to this is to link to the c-cindings (cBLAS, LAPACKE). While BLAS is truly a PITA, most systems have quite sane cBLAS bindings (notable exception is OpenBLAS...) and often it is enough to check whether the system has some libcBLAS.so somewhere in the path.

the big issue of having a header only BLAS are the long, long, long compile times.

Two solutions to this:

- precompiled headers
- from a header only C++BLAS library it is trivial to create a compiled library for the common cases where all elements have the same type.





_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Michael Lehn-2

On 24 Jan 2016, at 16:32, Michael Lehn <[hidden email]> wrote:

On 24 Jan 2016, at 16:17, Oswin Krause <[hidden email]> wrote:

Hi,

the obvious solution to this is to link to the c-cindings (cBLAS, LAPACKE). While BLAS is truly a PITA, most systems have quite sane cBLAS bindings (notable exception is OpenBLAS...) and often it is enough to check whether the system has some libcBLAS.so somewhere in the path.

the big issue of having a header only BLAS are the long, long, long compile times.

Two solutions to this:

- precompiled headers
- from a header only C++BLAS library it is trivial to create a compiled library for the common cases where all elements have the same type.

Also the compile time depends on the optimization level.  Maybe somebody else could try this, when I compile my demos with “-O1” it achieves
the same performance.  So this would be something new in the C++ world: reduce the optimization level :-)

I never tried, but it seems that at least with gcc you can do this on a file level


It also seems that only the two pack-functions need some optimization.  So the rest can even go lower.

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Peter Schmitteckert-2
In reply to this post by Karl Meerbergen-2
Dear All,


first of all I'm impressed by the recent results.

 > We need both native implementations and bindings.

I completely agree.

 > Native implementations are nice to try out a concept, bindings are needed for specific functionalities, sometimes for performance.

I'm using ublas for about 15 years in a HPC project sometimes running on over 1000 cores  24h/7 for months.
In this kind of applications it is critical to get maximum performance. Anything else just hurts.
For that I have to interface to MKL/ACML.

However, there are also applications, where I'm concerned with accuracy. I had problems where I had to use
multiprecision arithmetics for matrices. In eigen3 that worked amazingly well. Sadly, I can't use that for my
main code, as I'm using my BLAS interface for ublas directly at many places. (I wrote that before the bindings library).

Then I have applications, where we just want to play with funny ideas and flexibility in data types beyond BLAS .

 > packages are set up, linking is pretty straightforward.

Sometimes there are projects where you spend more time on resolving linking issues than doing numerics.
I'd also like to remark that with today’s machines we often need a fast implementation of BLAS operations only,
not an optimal one.

The problem for a generic C++ matrix interface is that we are waiting for it for decades now while many
interesting ideas/projects/implementations have gone with the tides.



Best regards.
Peter


P.S.
On 24 Jan 2016, at 16:32, Michael Lehn <[hidden email]> wrote:
 > So this would be something new in the C++ world: reduce the optimization level :-)
Well, not actually new, sometimes that happens.



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

smime.p7s (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Michael Lehn-2

On 24 Jan 2016, at 17:02, Peter Schmitteckert <[hidden email]> wrote:

> Dear All,
>
>
> first of all I'm impressed by the recent results.
>
> > We need both native implementations and bindings.
>
> I completely agree.

I also agree.  But that is no contradiction of specifying a unified low-level C++BLAS standard.  One
possible and obvious implementation would be a binding to BLAS or CBLAS.  Also there is no real
BLAS or CBLAS standard like in “ANSI” or “ISO”.  Its just a de facto standard.  So as long as it is
useful it will become a de facto standard.  Even if it just is the nicest way to be a BLAS-binding it
should be attractive for other C++ libraries like MTL4 or BLAZE.

Its also amazing that libraries like ATLAS, OpenBLAS or BLIS are written in C and not in C++.  I am
pretty familiar with their code and using C++ would make their implementation so much easier. And
I am not talking about OOP.  Just using template functions and doing the rest in C-style.  Basically
they use macros for generating functions like sgemm, dgemm, cgemm, zgemm from a single
implementation.  Very unreadable, believe me.  So even for a BLAS implementation that just has
to support a C and Fortran interface my choice would be C++ for the actual implementation.  And
I am certainly not C++ fan boy.  Really.
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

palik imre
In reply to this post by palik imre
AFAIK there is a boost::simd project.  If we really want SIMD classes, we might try to help them to get to mainline.

Right now I try to catch up with Michael using gcc SIMD vectors. (His code is still 10% faster ...)  This should work on gcc, icc, and clang.  I think that is general enough for most people's needs.

Anyway, we need a fallback path for non-builtin types, and that could be used for compilers not supporting gcc vectors.

Cheers,

Imre


On Sunday, 24 January 2016, 1:36, Joaquim Duran Comas <[hidden email]> wrote:


Hello,

It has been a great job.

The micro-kernel implementation of AVX has been implemented in assembler. Think that mscv, clang and g++ exposes SSE*, AVX, NEON and other SIMD to C language. So it should be possible to rewrite the asm code to C.

Also, basic SIMD classes could be created SIMD<char>, SIMD<float>.... to call the proper functions to implement the operations.

Joaquim Duran




_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Michael Lehn-2
In reply to this post by palik imre

On 23 Jan 2016, at 18:53, palik imre <[hidden email]> wrote:

Hi All,

what's next?  I mean what is the development process for ublas?

Now we have a C-like implementation that outperforms both the mainline, and the branch version (axpy_prod).  What will we do with that?

As far as I see we have the following options:

1) Create a C++ template magic implementation out of it.  But for this, at the least we would need compile-time access to the target instruction set.  Any idea how to do that?

2) Create a compiled library implementation out of it, and choose the implementation run-time based on the CPU capabilities.

3) Include some good defaults/defines, and hope the user will use them.

4) Don't include it, and do something completely different.


What do you think?


At the moment I am not sure, but pretty sure, that you don’t have to rewrite uBLAS to support good performance.  uBLAS is a pretty fine
library and already has what you need.  So just extend it. At least for 383 lines of code :-)

As I was pretty busy the last days, so I could not continue until today.  I made some minimal modifications to adopt the GEMM implementation
to take advantage of uBLAS:

- The GEMM frame algorithm and the pack functions now work with any uBLAS matrix that supports element access through the notation A(i,j)
- Only for matrix C in the operation C <- beta*C + alpha*A*B it requires that the matrix is stored row-major or col-major.  The other matrices can
be expressions. Here I need some help to get rid of this ugly lines of code:

    TC *C_ = &C(0,0);
    const size_type incRowC = &C(1,0) - &C(0,0);
    const size_type incColC = &C(0,1) - &C(0,0);

- Things like the following should work without performance penalty (and in particularly without creating temporaries):
  blas::axpy_prod(3*A1+A2, B1 - B2, C, matprodUpdate)
 
- And matrices A and B can be symmetric or whatever.  Of course if A or B is triangular a specialized implementation can take advantage
- The storage format does not matter.  You can mix row-major and col-major, packed, …

Here is the page for this:


Please note, creating the benchmarks for the symmetric matrix-matrix product really takes long because the current uBLAS implementation seems
to be much slower than for general matrices.  So I reduced the problem size for now.  It would be helpful if some people could reproduce the benchmarks
and also check different cases:

- different expressions
- different element types, e.g. A with floats, B with double etc.  At the moment there is only a micro kernel for double.  The performance depends on the
common type of A and B.  So with complex<..> the reference implementation.  But the performance should at least be constant in this case.

It would in particular be nice to have benchmarks from people with an Intel Sandybridge or Haswell as the micro kernels are optimized for these
architectures.  If interested I can extend the benchmarks to compare with Intel MKL.  For Linux there is a free non-commercial version available.


Cheers,

Michael






_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

palik imre
In reply to this post by palik imre
Following is the best gcc simd array based mikrokernel I was able to come up:

template <typename Index>
void
ugemm(Index kc, double alpha,
      const double *A, const double *B,
      double beta,
      double *C, Index incRowC, Index incColC)
{
    const Index MR = BlockSize<double>::MR;
    const Index NR = BlockSize<double>::NR;

    typedef double v4df __attribute__((vector_size (32)));

    v4df P[BlockSize<double>::MR*BlockSize<double>::NR/4 + 1] __attribute__ ((aligned (128)));
    const v4df *B_ = (v4df *)B;
    const v4df nv = {0.,0.,0.,0.};
    for (Index l=0; l<MR*NR/4; ++l) {
      P[l] = nv;
    }
    for (Index i=0; i<MR; ++i) {
      for (Index l=0; l<kc; ++l) {
        for (Index j=0; j<(NR/4); ++j) {
          P[i * NR/4 + j] += A[l + i*kc]*B_[l*(NR/4)+j];
        }
      }
    }
    double *P_ = (double *)P;
    for (Index j=0; j<NR; ++j) {
        for (Index i=0; i<MR; ++i) {
            C[i*incRowC+j*incColC] *= beta;
            C[i*incRowC+j*incColC] += alpha*P_[i * NR + j];
        }
    }
}

Notes about it

- It is row major, as I can think easier that way.  So it needs a different packing routine.

- It won't compile on gcc 4.6, as that compiler is unwilling to do vbroadcastsd.

- It assumes that the A & B arrays are properly aligned. (gcc won't emit unaligned vector stores for simd arrays)

- It is really sensitive to block size.  On my old AMD box it come within 90% to Michael's AVX kernel with KC=64, MR=8, & NR = 16, while on my AVX2 box it gets within 70% to Michael's FMA kernel with KC=64, MR=8, & NR=32.  Part of the reason for the difference is that I cannot persuade gcc to accumulate to register.

Cheers,

Imre



On Monday, 25 January 2016, 16:10, palik imre <[hidden email]> wrote:


AFAIK there is a boost::simd project.  If we really want SIMD classes, we might try to help them to get to mainline.

Right now I try to catch up with Michael using gcc SIMD vectors. (His code is still 10% faster ...)  This should work on gcc, icc, and clang.  I think that is general enough for most people's needs.

Anyway, we need a fallback path for non-builtin types, and that could be used for compilers not supporting gcc vectors.

Cheers,

Imre


On Sunday, 24 January 2016, 1:36, Joaquim Duran Comas <[hidden email]> wrote:


Hello,

It has been a great job.

The micro-kernel implementation of AVX has been implemented in assembler. Think that mscv, clang and g++ exposes SSE*, AVX, NEON and other SIMD to C language. So it should be possible to rewrite the asm code to C.

Also, basic SIMD classes could be created SIMD<char>, SIMD<float>.... to call the proper functions to implement the operations.

Joaquim Duran




_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Matrix multiplication performance

Nasos Iliopoulos
In reply to this post by Michael Lehn-2
We would definitely like to adopt an implementation like this. We need though to make sure:

1. It is called  only if both arguments are dense matrices, or one dense matrix and one dense vector. We should disable any implicit conversions to avoid surprises like converting a sparse matrix to a dense one

2. We should provide extensive benchmarks for a large range of matrix sizes.

3. We should setup benchmarks that find optimal sizes for the micro and meso kernels. These benchmarks may be slow and we may want to find a way to publish optimal sizes for various architectures. Maybe just include them in a readme file.

4. We should provide an interface to compile with these optimal sizes. Provide fallback when these are not provided

5. ublas is one of the fastest libraries for small matrices. Because I think the most prevalent use of dense multiplication is on small matrices ( geometric transformations etc.) we need to make sure this stays that way.

Concerning 3, please note that my experience is that there is not a single combination that is optimal for all matrix sizes and this is reflected in the previous developments in the fast matrix multiplication branch on github. I hope we can find a good compromise.

I will  create a branch with the aim to create an implementation that embodies 1-4 and uses Michael's work. I apologize for not having a lot of time to invest in this endeavor atm but I think if I make an outline of the integration requirements we can work it out through pull requests. When things are setup I will be running benchmarks on various architectures and compilers ( currently I have access to clang, gcc, intel and msvc)

-Nasos


On 01/26/2016 07:39 PM, Michael Lehn wrote:

On 23 Jan 2016, at 18:53, palik imre <[hidden email]> wrote:

Hi All,

what's next?  I mean what is the development process for ublas?

Now we have a C-like implementation that outperforms both the mainline, and the branch version (axpy_prod).  What will we do with that?

As far as I see we have the following options:

1) Create a C++ template magic implementation out of it.  But for this, at the least we would need compile-time access to the target instruction set.  Any idea how to do that?

2) Create a compiled library implementation out of it, and choose the implementation run-time based on the CPU capabilities.

3) Include some good defaults/defines, and hope the user will use them.

4) Don't include it, and do something completely different.


What do you think?


At the moment I am not sure, but pretty sure, that you don’t have to rewrite uBLAS to support good performance.  uBLAS is a pretty fine
library and already has what you need.  So just extend it. At least for 383 lines of code :-)

As I was pretty busy the last days, so I could not continue until today.  I made some minimal modifications to adopt the GEMM implementation
to take advantage of uBLAS:

- The GEMM frame algorithm and the pack functions now work with any uBLAS matrix that supports element access through the notation A(i,j)
- Only for matrix C in the operation C <- beta*C + alpha*A*B it requires that the matrix is stored row-major or col-major.  The other matrices can
be expressions. Here I need some help to get rid of this ugly lines of code:

    TC *C_ = &C(0,0);
    const size_type incRowC = &C(1,0) - &C(0,0);
    const size_type incColC = &C(0,1) - &C(0,0);

- Things like the following should work without performance penalty (and in particularly without creating temporaries):
  blas::axpy_prod(3*A1+A2, B1 - B2, C, matprodUpdate)
 
- And matrices A and B can be symmetric or whatever.  Of course if A or B is triangular a specialized implementation can take advantage
- The storage format does not matter.  You can mix row-major and col-major, packed, …

Here is the page for this:


Please note, creating the benchmarks for the symmetric matrix-matrix product really takes long because the current uBLAS implementation seems
to be much slower than for general matrices.  So I reduced the problem size for now.  It would be helpful if some people could reproduce the benchmarks
and also check different cases:

- different expressions
- different element types, e.g. A with floats, B with double etc.  At the moment there is only a micro kernel for double.  The performance depends on the
common type of A and B.  So with complex<..> the reference implementation.  But the performance should at least be constant in this case.

It would in particular be nice to have benchmarks from people with an Intel Sandybridge or Haswell as the micro kernels are optimized for these
architectures.  If interested I can extend the benchmarks to compare with Intel MKL.  For Linux there is a free non-commercial version available.


Cheers,

Michael







_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
1234
Loading...