Quantcast

[PATCH 0/3] boost::ublas Improving the performance of dense matrix multiplication

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 0/3] boost::ublas Improving the performance of dense matrix multiplication

palik imre
This series pulls Michael Lehn's gemm implementation to ublas.

Performance on Haswell as per the bench1 test in ublas:

before:

DOUBLE, 3
bench_3
prod (matrix, matrix)
C array
elapsed: 0.4 s, 321.865 Mflops
c_matrix safe
elapsed: 3.35 s, 38.4317 Mflops
c_matrix fast
elapsed: 2.84 s, 45.3331 Mflops
matrix<unbounded_array> safe
elapsed: 6.28 s, 20.501 Mflops
matrix<unbounded_array> fast
elapsed: 5.87 s, 21.9329 Mflops
DOUBLE, 10
bench_3
prod (matrix, matrix)
C array
elapsed: 0.44 s, 411.814 Mflops
c_matrix safe
elapsed: 2.49 s, 72.7703 Mflops
c_matrix fast
elapsed: 2.3 s, 78.7818 Mflops
matrix<unbounded_array> safe
elapsed: 5.43 s, 33.3698 Mflops
matrix<unbounded_array> fast
elapsed: 5.42 s, 33.4314 Mflops
DOUBLE, 30
bench_3
prod (matrix, matrix)
C array
elapsed: 3.41 s, 445.514 Mflops
c_matrix safe
elapsed: 16.72 s, 90.8614 Mflops
c_matrix fast
elapsed: 16.23 s, 93.6046 Mflops
matrix<unbounded_array> safe
elapsed: 40.55 s, 37.4649 Mflops
matrix<unbounded_array> fast
elapsed: 40.52 s, 37.4927 Mflops
DOUBLE, 100
bench_3
prod (matrix, matrix)
C array
elapsed: 5.07 s, 374.322 Mflops
c_matrix safe
elapsed: 19.2 s, 98.8444 Mflops
c_matrix fast
elapsed: 19.06 s, 99.5704 Mflops
matrix<unbounded_array> safe
elapsed: 48.54 s, 39.0979 Mflops
matrix<unbounded_array> fast
elapsed: 48.54 s, 39.0979 Mflops
DOUBLE, 300
bench_3
prod (matrix, matrix)
C array
elapsed: 3.23 s, 477.516 Mflops
c_matrix safe
elapsed: 15.92 s, 96.883 Mflops
c_matrix fast
elapsed: 15.91 s, 96.9439 Mflops
matrix<unbounded_array> safe
elapsed: 39.6 s, 38.9489 Mflops
matrix<unbounded_array> fast
elapsed: 39.57 s, 38.9785 Mflops
DOUBLE, 1000
bench_3
prod (matrix, matrix)
C array
elapsed: 4.85 s, 393.071 Mflops
c_matrix safe
elapsed: 19.9 s, 95.7987 Mflops
c_matrix fast
elapsed: 19.8 s, 96.2826 Mflops
matrix<unbounded_array> safe
elapsed: 49.16 s, 38.7794 Mflops
matrix<unbounded_array> fast
elapsed: 49.27 s, 38.6928 Mflops


after:

DOUBLE, 3
bench_3
prod (matrix, matrix)
C array
elapsed: 0.37 s, 347.962 Mflops
c_matrix safe
elapsed: 4.52 s, 28.4836 Mflops
c_matrix fast
elapsed: 4.47 s, 28.8022 Mflops
matrix<unbounded_array> safe
elapsed: 6.37 s, 20.2113 Mflops
matrix<unbounded_array> fast
elapsed: 7.75 s, 16.6124 Mflops
DOUBLE, 10
bench_3
prod (matrix, matrix)
C array
elapsed: 0.44 s, 411.814 Mflops
c_matrix safe
elapsed: 2.79 s, 64.9456 Mflops
c_matrix fast
elapsed: 2.79 s, 64.9456 Mflops
matrix<unbounded_array> safe
elapsed: 5.44 s, 33.3085 Mflops
matrix<unbounded_array> fast
elapsed: 5.89 s, 30.7637 Mflops
DOUBLE, 30
bench_3
prod (matrix, matrix)
C array
elapsed: 3.44 s, 441.629 Mflops
c_matrix safe
elapsed: 4.01 s, 378.854 Mflops
c_matrix fast
elapsed: 4.02 s, 377.911 Mflops
matrix<unbounded_array> safe
elapsed: 4.12 s, 368.739 Mflops
matrix<unbounded_array> fast
elapsed: 5.32 s, 285.565 Mflops
DOUBLE, 100
bench_3
prod (matrix, matrix)
C array
elapsed: 5.05 s, 375.804 Mflops
c_matrix safe
elapsed: 3.05 s, 622.233 Mflops
c_matrix fast
elapsed: 3.05 s, 622.233 Mflops
matrix<unbounded_array> safe
elapsed: 3.09 s, 614.179 Mflops
matrix<unbounded_array> fast
elapsed: 3.54 s, 536.105 Mflops
DOUBLE, 300
bench_3
prod (matrix, matrix)
C array
elapsed: 3.23 s, 477.516 Mflops
c_matrix safe
elapsed: 2.05 s, 752.379 Mflops
c_matrix fast
elapsed: 2.04 s, 756.067 Mflops
matrix<unbounded_array> safe
elapsed: 2.05 s, 752.379 Mflops
matrix<unbounded_array> fast
elapsed: 2.17 s, 710.773 Mflops
DOUBLE, 1000
bench_3
prod (matrix, matrix)
C array
elapsed: 4.83 s, 394.699 Mflops
c_matrix safe
elapsed: 2.37 s, 804.386 Mflops
c_matrix fast
elapsed: 2.37 s, 804.386 Mflops
matrix<unbounded_array> safe
elapsed: 2.39 s, 797.655 Mflops
matrix<unbounded_array> fast
elapsed: 2.44 s, 781.309 Mflops


Imre Palik (3):
  ublas: improved dense matrix multiplication performance
  boost::ublas: gcc support for optimal matrix multiplication
  boost::ublas increasing the range of BLAS level 3 benchmarks

 benchmarks/bench1/bench1.cpp                       |  14 +-
 benchmarks/bench1/bench13.cpp                      |   8 +
 benchmarks/bench3/bench3.cpp                       |  14 +-
 benchmarks/bench3/bench33.cpp                      |   8 +
 include/boost/numeric/ublas/detail/block_sizes.hpp |  78 +++++
 include/boost/numeric/ublas/detail/gemm.hpp        | 340 +++++++++++++++++++++
 include/boost/numeric/ublas/detail/vector.hpp      |  27 ++
 include/boost/numeric/ublas/matrix_expression.hpp  | 140 ++++++++-
 include/boost/numeric/ublas/operation.hpp          | 155 +++++++++-
 9 files changed, 757 insertions(+), 27 deletions(-)
 create mode 100644 include/boost/numeric/ublas/detail/block_sizes.hpp
 create mode 100644 include/boost/numeric/ublas/detail/gemm.hpp
 create mode 100644 include/boost/numeric/ublas/detail/vector.hpp

--
1.9.1

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 2/3] boost::ublas: gcc support for optimal matrix multiplication

palik imre
This patch contains an optimised, vectorised kernel, using gcc's
SIMD vectors.  This reaches matrix multiplication speeds
comparable to hand-crafted assembly kernels on x86 Haswell
microarchitecture.

The patch also contains optimised kerenl parameters for float,
double, and long double.

Signed-off-by: Imre Palik <[hidden email]>
Cc: Michael Lehn <[hidden email]>
---
 include/boost/numeric/ublas/detail/block_sizes.hpp |  78 +++++++
 include/boost/numeric/ublas/detail/gemm.hpp        | 223 +++++++++++----------
 include/boost/numeric/ublas/detail/vector.hpp      |  27 +++
 include/boost/numeric/ublas/matrix_expression.hpp  |   1 +
 include/boost/numeric/ublas/operation.hpp          |   1 +
 5 files changed, 223 insertions(+), 107 deletions(-)
 create mode 100644 include/boost/numeric/ublas/detail/block_sizes.hpp
 create mode 100644 include/boost/numeric/ublas/detail/vector.hpp

diff --git a/include/boost/numeric/ublas/detail/block_sizes.hpp b/include/boost/numeric/ublas/detail/block_sizes.hpp
new file mode 100644
index 0000000..a146f10
--- /dev/null
+++ b/include/boost/numeric/ublas/detail/block_sizes.hpp
@@ -0,0 +1,78 @@
+//
+//  Copyright (c) 2016
+//  Michael Lehn, Imre Palik
+//
+//  Distributed under the Boost Software License, Version 1.0. (See
+//  accompanying file LICENSE_1_0.txt or copy at
+//  http://www.boost.org/LICENSE_1_0.txt)
+
+#ifndef _BOOST_UBLAS_BLOCK_SIZES_
+#define _BOOST_UBLAS_BLOCK_SIZES_
+
+#include <boost/numeric/ublas/detail/vector.hpp>
+
+namespace boost { namespace numeric { namespace ublas { namespace detail {
+
+    template <typename T>
+    struct prod_block_size {
+        static const unsigned vector_length =
+            _BOOST_UBLAS_VECTOR_SIZE > sizeof(T)? _BOOST_UBLAS_VECTOR_SIZE/sizeof(T) : 1; // Number of elements in a vector register
+        static const unsigned mc = 256;
+        static const unsigned kc = 512; // stripe length
+        static const unsigned nc = (4096/(3 * vector_length)) * (3 * vector_length);
+        static const unsigned mr = 4; // stripe width for lhs
+        static const unsigned nr = 3 * vector_length; // stripe width for rhs
+        static const unsigned align = 64; // align temporary arrays to this boundary
+        static const unsigned limit = 27; // Use gemm from this size
+        BOOST_STATIC_ASSERT_MSG(mc>0 && kc>0 && nc>0 && mr>0 && nr>0, "Invalid block size.");
+        BOOST_STATIC_ASSERT_MSG(mc % mr == 0, "MC must be a multiple of MR.");
+        BOOST_STATIC_ASSERT_MSG(nc % nr == 0, "NC must be a multiple of NR.");
+    };
+
+    template <>
+    struct prod_block_size<float> {
+        static const unsigned mc = 256;
+        static const unsigned kc = 512; // stripe length
+        static const unsigned nc = 4096;
+        static const unsigned mr = 4; // stripe width for lhs
+        static const unsigned nr = 16; // stripe width for rhs
+        static const unsigned align = 64; // align temporary arrays to this boundary
+        static const unsigned limit = 27; // Use gemm from this size
+        static const unsigned vector_length = _BOOST_UBLAS_VECTOR_SIZE/sizeof(float); // Number of elements in a vector register
+        BOOST_STATIC_ASSERT_MSG(mc>0 && kc>0 && nc>0 && mr>0 && nr>0, "Invalid block size.");
+        BOOST_STATIC_ASSERT_MSG(mc % mr == 0, "MC must be a multiple of MR.");
+        BOOST_STATIC_ASSERT_MSG(nc % nr == 0, "NC must be a multiple of NR.");
+    };
+
+    template <>
+    struct prod_block_size<long double> {
+        static const unsigned mc = 256;
+        static const unsigned kc = 512; // stripe length
+        static const unsigned nc = 4096;
+        static const unsigned mr = 1; // stripe width for lhs
+        static const unsigned nr = 4; // stripe width for rhs
+        static const unsigned align = 64; // align temporary arrays to this boundary
+        static const unsigned limit = 71; // Use gemm from this size
+        static const unsigned vector_length = 1; // Number of elements in a vector register
+        BOOST_STATIC_ASSERT_MSG(mc>0 && kc>0 && nc>0 && mr>0 && nr>0, "Invalid block size.");
+        BOOST_STATIC_ASSERT_MSG(mc % mr == 0, "MC must be a multiple of MR.");
+        BOOST_STATIC_ASSERT_MSG(nc % nr == 0, "NC must be a multiple of NR.");
+    };
+
+    template<typename T>
+    struct is_blocksize {
+        struct fallback { static const int nr = 0; };
+        struct derived : T, fallback {};
+        template<int C1>
+        struct nonr {
+            static const bool value = false;
+            typedef false_type type;
+        };
+
+         template<typename C> static char (&f(typename nonr<C::nr>::type*))[1];
+         template<typename C> static char (&f(...))[2];
+
+         static bool const value = sizeof(f<derived>(0)) == 2;
+    };
+}}}}
+#endif
diff --git a/include/boost/numeric/ublas/detail/gemm.hpp b/include/boost/numeric/ublas/detail/gemm.hpp
index f66e80f..7fd0ab9 100644
--- a/include/boost/numeric/ublas/detail/gemm.hpp
+++ b/include/boost/numeric/ublas/detail/gemm.hpp
@@ -10,80 +10,26 @@
 #define _BOOST_UBLAS_GEMM_
 
 #include <boost/type_traits/common_type.hpp>
+#include <boost/type_traits/aligned_storage.hpp>
+#include <boost/type_traits/is_arithmetic.hpp>
 #include <boost/align/aligned_alloc.hpp>
 #include <boost/align/assume_aligned.hpp>
 #include <boost/static_assert.hpp>
 #include <boost/numeric/ublas/matrix_proxy.hpp>
+#include <boost/numeric/ublas/detail/vector.hpp>
+#include <boost/predef/compiler.h>
 
 namespace boost { namespace numeric { namespace ublas { namespace detail {
 
-    template <typename T>
-    struct prod_block_size {
-        static const unsigned mc = 256;
-        static const unsigned kc = 512; // stripe length
-        static const unsigned nc = 4092;
-        static const unsigned mr = 4; // stripe width for lhs
-        static const unsigned nr = 12; // stripe width for rhs
-        static const unsigned align = 64; // align temporary arrays to this boundary
-        static const unsigned limit = 26; // Use gemm from this size
-        BOOST_STATIC_ASSERT_MSG(mc>0 && kc>0 && nc>0 && mr>0 && nr>0, "Invalid block size.");
-        BOOST_STATIC_ASSERT_MSG(mc % mr == 0, "MC must be a multiple of MR.");
-        BOOST_STATIC_ASSERT_MSG(nc % nr == 0, "NC must be a multiple of NR.");
-    };
-
-    template <>
-    struct prod_block_size<float> {
-        static const unsigned mc = 256;
-        static const unsigned kc = 512; // stripe length
-        static const unsigned nc = 4080;
-        static const unsigned mr = 4; // stripe width for lhs
-        static const unsigned nr = 24; // stripe width for rhs
-        static const unsigned align = 64; // align temporary arrays to this boundary
-        static const unsigned limit = 26; // Use gemm from this size
-        BOOST_STATIC_ASSERT_MSG(mc>0 && kc>0 && nc>0 && mr>0 && nr>0, "Invalid block size.");
-        BOOST_STATIC_ASSERT_MSG(mc % mr == 0, "MC must be a multiple of MR.");
-        BOOST_STATIC_ASSERT_MSG(nc % nr == 0, "NC must be a multiple of NR.");
-    };
-
-    template <>
-    struct prod_block_size<long double> {
-        static const unsigned mc = 256;
-        static const unsigned kc = 512; // stripe length
-        static const unsigned nc = 4096;
-        static const unsigned mr = 2; // stripe width for lhs
-        static const unsigned nr = 2; // stripe width for rhs
-        static const unsigned align = 64; // align temporary arrays to this boundary
-        static const unsigned limit = 26; // Use gemm from this size
-        BOOST_STATIC_ASSERT_MSG(mc>0 && kc>0 && nc>0 && mr>0 && nr>0, "Invalid block size.");
-        BOOST_STATIC_ASSERT_MSG(mc % mr == 0, "MC must be a multiple of MR.");
-        BOOST_STATIC_ASSERT_MSG(nc % nr == 0, "NC must be a multiple of NR.");
-    };
-
-    template<typename T>
-    struct is_blocksize {
-        struct fallback { static const int nr = 0; };
-        struct derived : T, fallback {};
-        template<int C1>
-       struct nonr {
-           static const bool value = false;
-           typedef false_type type;
-       };
-
-         template<typename C> static char (&f(typename nonr<C::nr>::type*))[1];
-         template<typename C> static char (&f(...))[2];
-
-         static bool const value = sizeof(f<derived>(0)) == 2;
-    };
-
     template <typename E>
     void
     gescal(const typename E::value_type &alpha, matrix_expression<E> &X)
     {
         typedef typename E::size_type  size_type;
 
-       for (size_type i=0; i<X().size1(); ++i) {
+        for (size_type i=0; i<X().size1(); ++i) {
             for (size_type j=0; j<X().size2(); ++j) {
-               X()(i,j) *= alpha;
+                X()(i,j) *= alpha;
             }
         }
     }
@@ -114,9 +60,73 @@ namespace boost { namespace numeric { namespace ublas { namespace detail {
         }
     }
 
-    //-- Micro Kernel --------------------------------------------------------------
-    template <typename Index, typename T, typename TC, typename BlockSize>
+    //-- Micro Kernel ----------------------------------------------------------
+#if defined(BOOST_COMP_GNUC_DETECTION) \
+    && BOOST_COMP_GNUC_DETECTION >= BOOST_VERSION_NUMBER(4,8,0)
+    template <typename Index, typename T, typename TC,
+              typename BlockSize>
+    typename enable_if_c<is_arithmetic<T>::value
+                         && (1 < BlockSize::vector_length),
+                         void>::type
+    ugemm(Index kc, TC alpha, const T *A, const T *B,
+          TC beta, TC *C, Index incRowC, Index incColC)
+    {
+        BOOST_ALIGN_ASSUME_ALIGNED (A, BlockSize::align);
+        BOOST_ALIGN_ASSUME_ALIGNED (B, BlockSize::align);
+        static const unsigned vector_length = BlockSize::vector_length;
+        static const Index MR = BlockSize::mr;
+        static const Index NR = BlockSize::nr/vector_length;
+
+        typedef T vx __attribute__((vector_size (_BOOST_UBLAS_VECTOR_SIZE)));
+
+        vx P[MR*NR] = {};
+
+        const vx *b = (const vx *)B;
+        for (Index l=0; l<kc; ++l) {
+            for (Index i=0; i<MR; ++i) {
+                for (Index j=0; j<NR; ++j) {
+                    P[i*NR+j] += A[i]*b[j];
+                }
+            }
+            A += MR;
+            b += NR;
+        }
+
+        if (alpha!=TC(1)) {
+            for (Index i=0; i<MR; ++i) {
+                for (Index j=0; j<NR; ++j) {
+                    P[i*NR+j] *= alpha;
+                }
+            }
+        }
+
+        const T *p = (const T *) P;
+        if (beta == TC(0)) {
+            for (Index i=0; i<MR; ++i) {
+                for (Index j=0; j<NR * vector_length; ++j) {
+                    C[i*incRowC+j*incColC] = p[i*NR * vector_length +j];
+                }
+            }
+        } else {
+            for (Index i=0; i<MR; ++i) {
+                for (Index j=0; j<NR * vector_length; ++j) {
+                    C[i*incRowC+j*incColC] *= beta;
+                    C[i*incRowC+j*incColC] += p[i*NR * vector_length+j];
+                }
+            }
+        }
+    }
+
+    template <typename Index, typename T, typename TC,
+              typename BlockSize>
+    typename enable_if_c<!is_arithmetic<T>::value
+                         || (BlockSize::vector_length <= 1),
+                         void>::type
+#else
+    template <typename Index, typename T, typename TC,
+              typename BlockSize>
     void
+#endif
     ugemm(Index kc, TC alpha, const T *A, const T *B,
           TC beta, TC *C, Index incRowC, Index incColC)
     {
@@ -124,47 +134,47 @@ namespace boost { namespace numeric { namespace ublas { namespace detail {
         BOOST_ALIGN_ASSUME_ALIGNED(B, BlockSize::align);
         static const Index MR = BlockSize::mr;
         static const Index NR = BlockSize::nr;
-       typename boost::aligned_storage<sizeof(T[MR*NR]),BlockSize::align>::type Pa;
-       T *P = reinterpret_cast<T*>(Pa.address());
-       for (unsigned c = 0; c < MR * NR; c++)
-         P[c] = 0;
+        typename aligned_storage<sizeof(T[MR*NR]),BlockSize::align>::type Pa;
+        T *P = reinterpret_cast<T*>(Pa.address());
+        for (unsigned c = 0; c < MR * NR; c++)
+          P[c] = 0;
 
         for (Index l=0; l<kc; ++l) {
-           for (Index i=0; i<MR; ++i) {
+            for (Index i=0; i<MR; ++i) {
               for (Index j=0; j<NR; ++j) {
                     P[i* NR+j] += A[i]*B[j];
                 }
             }
-           A += MR;
-           B += NR;
+            A += MR;
+            B += NR;
+        }
+
+        if (alpha!=TC(1)) {
+            for (Index i=0; i<MR; ++i) {
+                for (Index j=0; j<NR; ++j) {
+                    P[i*NR+j] *= alpha;
+                }
+            }
         }
 
-       if (alpha!=TC(1)) {
-           for (Index i=0; i<MR; ++i) {
-               for (Index j=0; j<NR; ++j) {
-                   P[i*NR+j] *= alpha;
-               }
-           }
-       }
-
-       if (beta == TC(0)) {
-           for (Index i=0; i<MR; ++i) {
-               for (Index j=0; j<NR; ++j) {
+        if (beta == TC(0)) {
+            for (Index i=0; i<MR; ++i) {
+                for (Index j=0; j<NR; ++j) {
                 C[i*incRowC+j*incColC] = P[i*NR+j];
-               }
-           }
-       } else {
-           for (Index i=0; i<MR; ++i) {
-               for (Index j=0; j<NR; ++j) {
+                }
+            }
+        } else {
+            for (Index i=0; i<MR; ++i) {
+                for (Index j=0; j<NR; ++j) {
                     C[i*incRowC+j*incColC] *= beta;
                     C[i*incRowC+j*incColC] += P[i*NR+j];
-               }
-           }
-       }
+                }
+            }
+        }
     }
 
-    //-- Macro Kernel --------------------------------------------------------------
-       template <typename Index, typename T, typename TC, typename BlockSize>
+    //-- Macro Kernel ----------------------------------------------------------
+        template <typename Index, typename T, typename TC, typename BlockSize>
     void
     mgemm(Index mc, Index nc, Index kc, TC alpha,
           const T *A, const T *B, TC beta,
@@ -181,9 +191,6 @@ namespace boost { namespace numeric { namespace ublas { namespace detail {
         // #pragma omp parallel for
         // #endif
         for (Index j=0; j<np; ++j) {
-      // __builtin_prefetch(B + j * kc * NR, 0);
-      // __builtin_prefetch(B + j * kc * NR + 8, 0);
-      // __builtin_prefetch(B + j * kc * NR + 16, 0);
             const Index nr = (j!=np-1 || nr_==0) ? NR : nr_;
             T C_[MR*NR];
 
@@ -191,7 +198,7 @@ namespace boost { namespace numeric { namespace ublas { namespace detail {
                 const Index mr = (i!=mp-1 || mr_==0) ? MR : mr_;
 
                 if (mr==MR && nr==NR) {
-                 ugemm<Index, T, TC, BlockSize>(kc, alpha,
+                  ugemm<Index, T, TC, BlockSize>(kc, alpha,
                           &A[i*kc*MR], &B[j*kc*NR],
                           beta,
                           &C[i*MR*incRowC+j*NR*incColC],
@@ -214,11 +221,12 @@ namespace boost { namespace numeric { namespace ublas { namespace detail {
     }
 
     //-- Packing blocks ------------------------------------------------------------
-       template <typename E, typename T, typename BlockSize>
+        template <typename E, typename T, typename BlockSize>
     void
     pack_A(const matrix_expression<E> &A, T *p)
     {
         typedef typename E::size_type  size_type;
+        BOOST_ALIGN_ASSUME_ALIGNED(p, BlockSize::align);
 
         const size_type mc = A ().size1();
         const size_type kc = A ().size2();
@@ -236,11 +244,12 @@ namespace boost { namespace numeric { namespace ublas { namespace detail {
         }
     }
 
-       template <typename E, typename T, typename BlockSize>
+        template <typename E, typename T, typename BlockSize>
     void
     pack_B(const matrix_expression<E> &B, T *p)
     {
         typedef typename E::size_type  size_type;
+        BOOST_ALIGN_ASSUME_ALIGNED(p, BlockSize::align);
 
         const size_type kc = B ().size1();
         const size_type nc = B ().size2();
@@ -262,9 +271,9 @@ namespace boost { namespace numeric { namespace ublas { namespace detail {
     template <typename E1, typename E2, typename E3, typename BlockSize>
     void
     gemm(typename E3::value_type alpha, const matrix_expression<E1> &e1,
-        const matrix_expression<E2> &e2,
+         const matrix_expression<E2> &e2,
          typename E3::value_type beta, matrix_expression<E3> &e3,
-        BlockSize)
+         BlockSize)
     {
         typedef typename E3::size_type  size_type;
         typedef typename E1::value_type value_type1;
@@ -276,7 +285,7 @@ namespace boost { namespace numeric { namespace ublas { namespace detail {
         static const size_type NC = BlockSize::nc;
 
         const size_type m = BOOST_UBLAS_SAME (e3 ().size1 (), e1 ().size1 ());
-       const size_type n = BOOST_UBLAS_SAME (e3 ().size2 (), e2 ().size2 ());
+        const size_type n = BOOST_UBLAS_SAME (e3 ().size2 (), e2 ().size2 ());
         const size_type k = BOOST_UBLAS_SAME (e1 ().size2 (), e2 ().size1 ());
 
         static const size_type KC = BlockSize::kc;
@@ -291,11 +300,11 @@ namespace boost { namespace numeric { namespace ublas { namespace detail {
         const size_type incRowC = &e3()(1,0) - &e3()(0,0);
         const size_type incColC = &e3()(0,1) - &e3()(0,0);
         value_type_i *A =
-            static_cast<value_type_i *>(boost::alignment::aligned_alloc(BlockSize::align,
-                                                                       sizeof(value_type_i) * MC*KC));
+            static_cast<value_type_i *>(::boost::alignment::aligned_alloc(BlockSize::align,
+                                                                        sizeof(value_type_i) * MC*KC));
         value_type_i *B =
-            static_cast<value_type_i *>(boost::alignment::aligned_alloc(BlockSize::align,
-                                                                     sizeof(value_type_i) * KC*NC));
+            static_cast<value_type_i *>(::boost::alignment::aligned_alloc(BlockSize::align,
+                                                                      sizeof(value_type_i) * KC*NC));
 
         if (alpha==value_type3(0) || k==0) {
             gescal(beta, e3);
@@ -319,13 +328,13 @@ namespace boost { namespace numeric { namespace ublas { namespace detail {
                     pack_A<matrix_range<const E1>, value_type_i, BlockSize>(As, A);
 
                     mgemm<size_type, value_type_i, value_type3, BlockSize>(mc, nc, kc, alpha, A, B, beta_,
-                         &C_[i*MC*incRowC+j*NC*incColC],
-                         incRowC, incColC);
+                          &C_[i*MC*incRowC+j*NC*incColC],
+                          incRowC, incColC);
                 }
             }
         }
-       boost::alignment::aligned_free(A);
-       boost::alignment::aligned_free(B);
+        ::boost::alignment::aligned_free(A);
+        ::boost::alignment::aligned_free(B);
     }
 }}}}
 #endif
diff --git a/include/boost/numeric/ublas/detail/vector.hpp b/include/boost/numeric/ublas/detail/vector.hpp
new file mode 100644
index 0000000..1681eda
--- /dev/null
+++ b/include/boost/numeric/ublas/detail/vector.hpp
@@ -0,0 +1,27 @@
+//
+//  Copyright (c) 2016
+//  Imre Palik
+//
+//  Distributed under the Boost Software License, Version 1.0. (See
+//  accompanying file LICENSE_1_0.txt or copy at
+//  http://www.boost.org/LICENSE_1_0.txt)
+
+#ifndef _BOOST_UBLAS_DETAIL_VECTOR_
+#define _BOOST_UBLAS_DETAIL_VECTOR_
+#include <boost/predef/hardware/simd.h>
+
+#if defined(BOOST_HW_SIMD) && (BOOST_HW_SIMD == BOOST_HW_SIMD_X86 || BOOST_HW_SIMD == BOOST_HW_SIMD_X86_AMD)
+#if BOOST_HW_SIMD >= BOOST_HW_SIMD_X86_AVX_VERSION
+#define _BOOST_UBLAS_VECTOR_SIZE 32
+#elif BOOST_HW_SIMD >= BOOST_HW_SIMD_X86_SSE_VERSION
+#define _BOOST_UBLAS_VECTOR_SIZE 16
+#else
+#define _BOOST_UBLAS_VECTOR_SIZE 8
+#endif
+#endif
+
+#ifndef _BOOST_UBLAS_VECTOR_SIZE
+#define _BOOST_UBLAS_VECTOR_SIZE 1
+#endif
+
+#endif
diff --git a/include/boost/numeric/ublas/matrix_expression.hpp b/include/boost/numeric/ublas/matrix_expression.hpp
index 78daf2b..c4e0ac8 100644
--- a/include/boost/numeric/ublas/matrix_expression.hpp
+++ b/include/boost/numeric/ublas/matrix_expression.hpp
@@ -15,6 +15,7 @@
 
 #include <boost/numeric/ublas/vector_expression.hpp>
 #include <boost/numeric/ublas/detail/gemm.hpp>
+#include <boost/numeric/ublas/detail/block_sizes.hpp>
 
 // Expression templates based on ideas of Todd Veldhuizen and Geoffrey Furnish
 // Iterators based on ideas of Jeremy Siek
diff --git a/include/boost/numeric/ublas/operation.hpp b/include/boost/numeric/ublas/operation.hpp
index 393a9bd..8fd3b63 100644
--- a/include/boost/numeric/ublas/operation.hpp
+++ b/include/boost/numeric/ublas/operation.hpp
@@ -15,6 +15,7 @@
 
 #include <boost/numeric/ublas/matrix_proxy.hpp>
 #include <boost/numeric/ublas/detail/gemm.hpp>
+#include <boost/numeric/ublas/detail/block_sizes.hpp>
 /** \file operation.hpp
  *  \brief This file contains some specialized products.
  */
--
1.9.1

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 1/3] ublas: improved dense matrix multiplication performance

palik imre
In reply to this post by palik imre
This patch includes the gemm implementation from Michael Lehn to
boost::ublas.

This modifies the workings of ublas::prod() and ublas::axppy_prod()
to use gemm() above a certain matrix size.

This patch only contains the basic architecture, and a generic c++
implementation.

Signed-off-by: Imre Palik <[hidden email]>
Cc: Michael Lehn <[hidden email]>
---
 include/boost/numeric/ublas/detail/gemm.hpp       | 331 ++++++++++++++++++++++
 include/boost/numeric/ublas/matrix_expression.hpp | 139 ++++++++-
 include/boost/numeric/ublas/operation.hpp         | 154 +++++++++-
 3 files changed, 605 insertions(+), 19 deletions(-)
 create mode 100644 include/boost/numeric/ublas/detail/gemm.hpp

diff --git a/include/boost/numeric/ublas/detail/gemm.hpp b/include/boost/numeric/ublas/detail/gemm.hpp
new file mode 100644
index 0000000..f66e80f
--- /dev/null
+++ b/include/boost/numeric/ublas/detail/gemm.hpp
@@ -0,0 +1,331 @@
+//
+//  Copyright (c) 2016
+//  Michael Lehn, Imre Palik
+//
+//  Distributed under the Boost Software License, Version 1.0. (See
+//  accompanying file LICENSE_1_0.txt or copy at
+//  http://www.boost.org/LICENSE_1_0.txt)
+
+#ifndef _BOOST_UBLAS_GEMM_
+#define _BOOST_UBLAS_GEMM_
+
+#include <boost/type_traits/common_type.hpp>
+#include <boost/align/aligned_alloc.hpp>
+#include <boost/align/assume_aligned.hpp>
+#include <boost/static_assert.hpp>
+#include <boost/numeric/ublas/matrix_proxy.hpp>
+
+namespace boost { namespace numeric { namespace ublas { namespace detail {
+
+    template <typename T>
+    struct prod_block_size {
+        static const unsigned mc = 256;
+        static const unsigned kc = 512; // stripe length
+        static const unsigned nc = 4092;
+        static const unsigned mr = 4; // stripe width for lhs
+        static const unsigned nr = 12; // stripe width for rhs
+        static const unsigned align = 64; // align temporary arrays to this boundary
+        static const unsigned limit = 26; // Use gemm from this size
+        BOOST_STATIC_ASSERT_MSG(mc>0 && kc>0 && nc>0 && mr>0 && nr>0, "Invalid block size.");
+        BOOST_STATIC_ASSERT_MSG(mc % mr == 0, "MC must be a multiple of MR.");
+        BOOST_STATIC_ASSERT_MSG(nc % nr == 0, "NC must be a multiple of NR.");
+    };
+
+    template <>
+    struct prod_block_size<float> {
+        static const unsigned mc = 256;
+        static const unsigned kc = 512; // stripe length
+        static const unsigned nc = 4080;
+        static const unsigned mr = 4; // stripe width for lhs
+        static const unsigned nr = 24; // stripe width for rhs
+        static const unsigned align = 64; // align temporary arrays to this boundary
+        static const unsigned limit = 26; // Use gemm from this size
+        BOOST_STATIC_ASSERT_MSG(mc>0 && kc>0 && nc>0 && mr>0 && nr>0, "Invalid block size.");
+        BOOST_STATIC_ASSERT_MSG(mc % mr == 0, "MC must be a multiple of MR.");
+        BOOST_STATIC_ASSERT_MSG(nc % nr == 0, "NC must be a multiple of NR.");
+    };
+
+    template <>
+    struct prod_block_size<long double> {
+        static const unsigned mc = 256;
+        static const unsigned kc = 512; // stripe length
+        static const unsigned nc = 4096;
+        static const unsigned mr = 2; // stripe width for lhs
+        static const unsigned nr = 2; // stripe width for rhs
+        static const unsigned align = 64; // align temporary arrays to this boundary
+        static const unsigned limit = 26; // Use gemm from this size
+        BOOST_STATIC_ASSERT_MSG(mc>0 && kc>0 && nc>0 && mr>0 && nr>0, "Invalid block size.");
+        BOOST_STATIC_ASSERT_MSG(mc % mr == 0, "MC must be a multiple of MR.");
+        BOOST_STATIC_ASSERT_MSG(nc % nr == 0, "NC must be a multiple of NR.");
+    };
+
+    template<typename T>
+    struct is_blocksize {
+        struct fallback { static const int nr = 0; };
+        struct derived : T, fallback {};
+        template<int C1>
+       struct nonr {
+           static const bool value = false;
+           typedef false_type type;
+       };
+
+         template<typename C> static char (&f(typename nonr<C::nr>::type*))[1];
+         template<typename C> static char (&f(...))[2];
+
+         static bool const value = sizeof(f<derived>(0)) == 2;
+    };
+
+    template <typename E>
+    void
+    gescal(const typename E::value_type &alpha, matrix_expression<E> &X)
+    {
+        typedef typename E::size_type  size_type;
+
+       for (size_type i=0; i<X().size1(); ++i) {
+            for (size_type j=0; j<X().size2(); ++j) {
+               X()(i,j) *= alpha;
+            }
+        }
+    }
+
+    template <typename Index, typename T>
+    void
+    geaxpy(Index m, Index n, const T &alpha,
+           const T *X, Index incRowX, Index incColX,
+           T       *Y, Index incRowY, Index incColY)
+    {
+        for (Index j=0; j<n; ++j) {
+            for (Index i=0; i<m; ++i) {
+                Y[i*incRowY+j*incColY] += alpha*X[i*incRowX+j*incColX];
+            }
+        }
+    }
+
+    template <typename Index, typename TX>
+    void
+    gescal(Index m, Index n,
+           const TX &alpha,
+           TX *X, Index incRowX, Index incColX)
+    {
+        for (Index j=0; j<n; ++j) {
+            for (Index i=0; i<m; ++i) {
+                X[i*incRowX+j*incColX] *= alpha;
+            }
+        }
+    }
+
+    //-- Micro Kernel --------------------------------------------------------------
+    template <typename Index, typename T, typename TC, typename BlockSize>
+    void
+    ugemm(Index kc, TC alpha, const T *A, const T *B,
+          TC beta, TC *C, Index incRowC, Index incColC)
+    {
+        BOOST_ALIGN_ASSUME_ALIGNED(A, BlockSize::align);
+        BOOST_ALIGN_ASSUME_ALIGNED(B, BlockSize::align);
+        static const Index MR = BlockSize::mr;
+        static const Index NR = BlockSize::nr;
+       typename boost::aligned_storage<sizeof(T[MR*NR]),BlockSize::align>::type Pa;
+       T *P = reinterpret_cast<T*>(Pa.address());
+       for (unsigned c = 0; c < MR * NR; c++)
+         P[c] = 0;
+
+        for (Index l=0; l<kc; ++l) {
+           for (Index i=0; i<MR; ++i) {
+              for (Index j=0; j<NR; ++j) {
+                    P[i* NR+j] += A[i]*B[j];
+                }
+            }
+           A += MR;
+           B += NR;
+        }
+
+       if (alpha!=TC(1)) {
+           for (Index i=0; i<MR; ++i) {
+               for (Index j=0; j<NR; ++j) {
+                   P[i*NR+j] *= alpha;
+               }
+           }
+       }
+
+       if (beta == TC(0)) {
+           for (Index i=0; i<MR; ++i) {
+               for (Index j=0; j<NR; ++j) {
+                C[i*incRowC+j*incColC] = P[i*NR+j];
+               }
+           }
+       } else {
+           for (Index i=0; i<MR; ++i) {
+               for (Index j=0; j<NR; ++j) {
+                    C[i*incRowC+j*incColC] *= beta;
+                    C[i*incRowC+j*incColC] += P[i*NR+j];
+               }
+           }
+       }
+    }
+
+    //-- Macro Kernel --------------------------------------------------------------
+       template <typename Index, typename T, typename TC, typename BlockSize>
+    void
+    mgemm(Index mc, Index nc, Index kc, TC alpha,
+          const T *A, const T *B, TC beta,
+          TC *C, Index incRowC, Index incColC)
+    {
+        static const Index MR = BlockSize::mr;
+        static const Index NR = BlockSize::nr;
+        const Index mp  = (mc+MR-1) / MR;
+        const Index np  = (nc+NR-1) / NR;
+        const Index mr_ = mc % MR;
+        const Index nr_ = nc % NR;
+
+        // #if defined(_OPENMP)
+        // #pragma omp parallel for
+        // #endif
+        for (Index j=0; j<np; ++j) {
+      // __builtin_prefetch(B + j * kc * NR, 0);
+      // __builtin_prefetch(B + j * kc * NR + 8, 0);
+      // __builtin_prefetch(B + j * kc * NR + 16, 0);
+            const Index nr = (j!=np-1 || nr_==0) ? NR : nr_;
+            T C_[MR*NR];
+
+            for (Index i=0; i<mp; ++i) {
+                const Index mr = (i!=mp-1 || mr_==0) ? MR : mr_;
+
+                if (mr==MR && nr==NR) {
+                 ugemm<Index, T, TC, BlockSize>(kc, alpha,
+                          &A[i*kc*MR], &B[j*kc*NR],
+                          beta,
+                          &C[i*MR*incRowC+j*NR*incColC],
+                          incRowC, incColC);
+                } else {
+                    std::fill_n(C_, MR*NR, T(0));
+                    ugemm<Index, T, TC, BlockSize>(kc, alpha,
+                          &A[i*kc*MR], &B[j*kc*NR],
+                          T(0),
+                          C_, NR, Index(1));
+                    gescal(mr, nr, beta,
+                           &C[i*MR*incRowC+j*NR*incColC],
+                           incRowC, incColC);
+                    geaxpy(mr, nr, T(1), C_, NR, Index(1),
+                           &C[i*MR*incRowC+j*NR*incColC],
+                           incRowC, incColC);
+                }
+            }
+        }
+    }
+
+    //-- Packing blocks ------------------------------------------------------------
+       template <typename E, typename T, typename BlockSize>
+    void
+    pack_A(const matrix_expression<E> &A, T *p)
+    {
+        typedef typename E::size_type  size_type;
+
+        const size_type mc = A ().size1();
+        const size_type kc = A ().size2();
+        static const size_type MR = BlockSize::mr;
+        const size_type mp = (mc+MR-1) / MR;
+
+        for (size_type j=0; j<kc; ++j) {
+            for (size_type l=0; l<mp; ++l) {
+                for (size_type i0=0; i0<MR; ++i0) {
+                    size_type i  = l*MR + i0;
+                    size_type nu = l*MR*kc + j*MR + i0;
+                    p[nu]        = (i<mc) ? A()(i,j) : T(0);
+                }
+            }
+        }
+    }
+
+       template <typename E, typename T, typename BlockSize>
+    void
+    pack_B(const matrix_expression<E> &B, T *p)
+    {
+        typedef typename E::size_type  size_type;
+
+        const size_type kc = B ().size1();
+        const size_type nc = B ().size2();
+        static const size_type NR = BlockSize::nr;
+        const size_type np = (nc+NR-1) / NR;
+
+        for (size_type l=0; l<np; ++l) {
+            for (size_type j0=0; j0<NR; ++j0) {
+                for (size_type i=0; i<kc; ++i) {
+                    size_type j  = l*NR+j0;
+                    size_type nu = l*NR*kc + i*NR + j0;
+                    p[nu]        = (j<nc) ? B()(i,j) : T(0);
+                }
+            }
+        }
+    }
+
+    //-- Frame routine -------------------------------------------------------------
+    template <typename E1, typename E2, typename E3, typename BlockSize>
+    void
+    gemm(typename E3::value_type alpha, const matrix_expression<E1> &e1,
+        const matrix_expression<E2> &e2,
+         typename E3::value_type beta, matrix_expression<E3> &e3,
+        BlockSize)
+    {
+        typedef typename E3::size_type  size_type;
+        typedef typename E1::value_type value_type1;
+        typedef typename E2::value_type value_type2;
+        typedef typename E3::value_type value_type3;
+        typedef typename common_type<value_type1, value_type2>::type value_type_i;
+
+        static const size_type MC = BlockSize::mc;
+        static const size_type NC = BlockSize::nc;
+
+        const size_type m = BOOST_UBLAS_SAME (e3 ().size1 (), e1 ().size1 ());
+       const size_type n = BOOST_UBLAS_SAME (e3 ().size2 (), e2 ().size2 ());
+        const size_type k = BOOST_UBLAS_SAME (e1 ().size2 (), e2 ().size1 ());
+
+        static const size_type KC = BlockSize::kc;
+        const size_type mb = (m+MC-1) / MC;
+        const size_type nb = (n+NC-1) / NC;
+        const size_type kb = (k+KC-1) / KC;
+        const size_type mc_ = m % MC;
+        const size_type nc_ = n % NC;
+        const size_type kc_ = k % KC;
+
+        value_type3 *C_ = &e3()(0,0);
+        const size_type incRowC = &e3()(1,0) - &e3()(0,0);
+        const size_type incColC = &e3()(0,1) - &e3()(0,0);
+        value_type_i *A =
+            static_cast<value_type_i *>(boost::alignment::aligned_alloc(BlockSize::align,
+                                                                       sizeof(value_type_i) * MC*KC));
+        value_type_i *B =
+            static_cast<value_type_i *>(boost::alignment::aligned_alloc(BlockSize::align,
+                                                                     sizeof(value_type_i) * KC*NC));
+
+        if (alpha==value_type3(0) || k==0) {
+            gescal(beta, e3);
+            return;
+        }
+
+        for (size_type j=0; j<nb; ++j) {
+            size_type nc = (j!=nb-1 || nc_==0) ? NC : nc_;
+
+            for (size_type l=0; l<kb; ++l) {
+                size_type kc = (l!=kb-1 || kc_==0) ? KC : kc_;
+                value_type3 beta_ = (l==0) ? beta : value_type3(1);
+
+                const matrix_range<const E2> Bs = subrange(e2(), l*KC, l*KC+kc, j*NC, j*NC+nc);
+                pack_B<matrix_range<const E2>, value_type_i, BlockSize>(Bs, B);
+
+                for (size_type i=0; i<mb; ++i) {
+                    size_type mc = (i!=mb-1 || mc_==0) ? MC : mc_;
+
+                    const matrix_range<const E1> As = subrange(e1(), i*MC, i*MC+mc, l*KC, l*KC+kc);
+                    pack_A<matrix_range<const E1>, value_type_i, BlockSize>(As, A);
+
+                    mgemm<size_type, value_type_i, value_type3, BlockSize>(mc, nc, kc, alpha, A, B, beta_,
+                         &C_[i*MC*incRowC+j*NC*incColC],
+                         incRowC, incColC);
+                }
+            }
+        }
+       boost::alignment::aligned_free(A);
+       boost::alignment::aligned_free(B);
+    }
+}}}}
+#endif
diff --git a/include/boost/numeric/ublas/matrix_expression.hpp b/include/boost/numeric/ublas/matrix_expression.hpp
index a363130..78daf2b 100644
--- a/include/boost/numeric/ublas/matrix_expression.hpp
+++ b/include/boost/numeric/ublas/matrix_expression.hpp
@@ -14,6 +14,7 @@
 #define _BOOST_UBLAS_MATRIX_EXPRESSION_
 
 #include <boost/numeric/ublas/vector_expression.hpp>
+#include <boost/numeric/ublas/detail/gemm.hpp>
 
 // Expression templates based on ideas of Todd Veldhuizen and Geoffrey Furnish
 // Iterators based on ideas of Jeremy Siek
@@ -5460,45 +5461,104 @@ namespace boost { namespace numeric { namespace ublas {
         expression2_closure_type e2_;
     };
 
+    namespace detail {
+      template<class E1, class E2, class P, bool s>
+      struct binary_calculate_result_type;
+
+      template<class E1, class E2, class P>
+      struct binary_calculate_result_type<E1, E2, P, false> {
+       typedef matrix_matrix_binary<E1, E2, matrix_matrix_prod<E1, E2, P> > result_type;
+      };
+
+      template<class E1, class E2, class P>
+      struct binary_calculate_result_type<E1, E2, P, true> {
+       typedef matrix<P> result_type;
+      };
+    }
+
     template<class T1, class E1, class T2, class E2>
     struct matrix_matrix_binary_traits {
-        typedef unknown_storage_tag storage_category;
+      //        typedef unknown_storage_tag storage_category;
+      typedef typename storage_restrict_traits<typename E1::storage_category, typename E2::storage_category>::storage_category storage_category;
         typedef unknown_orientation_tag orientation_category;
         typedef typename promote_traits<T1, T2>::promote_type promote_type;
         typedef matrix_matrix_binary<E1, E2, matrix_matrix_prod<E1, E2, promote_type> > expression_type;
 #ifndef BOOST_UBLAS_SIMPLE_ET_DEBUG
-        typedef expression_type result_type;
+      //        typedef expression_type result_type;
+      typedef typename detail::binary_calculate_result_type<E1, E2, promote_type, boost::is_base_of<dense_proxy_tag, storage_category>::value>::result_type result_type;
 #else
         typedef typename E1::matrix_temporary_type result_type;
 #endif
     };
 
-    template<class E1, class E2>
+    template<class E1, class E2, class B>
     BOOST_UBLAS_INLINE
     typename matrix_matrix_binary_traits<typename E1::value_type, E1,
                                          typename E2::value_type, E2>::result_type
     prod (const matrix_expression<E1> &e1,
           const matrix_expression<E2> &e2,
+         B,
           unknown_storage_tag,
           unknown_orientation_tag) {
+
         typedef typename matrix_matrix_binary_traits<typename E1::value_type, E1,
                                                      typename E2::value_type, E2>::expression_type expression_type;
         return expression_type (e1 (), e2 ());
     }
 
-    // Dispatcher
-    template<class E1, class E2>
+    template<class E1, class E2, class B>
     BOOST_UBLAS_INLINE
     typename matrix_matrix_binary_traits<typename E1::value_type, E1,
                                          typename E2::value_type, E2>::result_type
     prod (const matrix_expression<E1> &e1,
-          const matrix_expression<E2> &e2) {
+          const matrix_expression<E2> &e2,
+         B b,
+          dense_proxy_tag,
+          unknown_orientation_tag) {
+        typedef typename matrix_matrix_binary_traits<typename E1::value_type, E1,
+                                                     typename E2::value_type, E2>::expression_type expression_type;
+        typedef typename matrix_matrix_binary_traits<typename E1::value_type, E1,
+                                                     typename E2::value_type, E2>::result_type result_type;
+       typedef typename result_type::value_type result_value;
+
+       if (e1 ().size1() < B::limit || e2 ().size2() < B::limit) {
+           return expression_type (e1 (), e2 ());
+       } else {
+            result_type rv(e1 ().size1(), e2 ().size2());
+           detail::gemm(result_value(1), e1, e2,
+                        result_value(0), rv, b);
+           return rv;
+       }
+    }
+
+    // Dispatcher
+    template<class E1, class E2, class B>
+    BOOST_UBLAS_INLINE
+    typename enable_if_c<detail::is_blocksize<B>::value,
+                        typename matrix_matrix_binary_traits<typename E1::value_type, E1,
+                                                             typename E2::value_type, E2>::result_type>::type
+    prod (const matrix_expression<E1> &e1,
+          const matrix_expression<E2> &e2,
+         B b) {
         BOOST_STATIC_ASSERT (E1::complexity == 0 && E2::complexity == 0);
         typedef typename matrix_matrix_binary_traits<typename E1::value_type, E1,
                                                      typename E2::value_type, E2>::storage_category storage_category;
         typedef typename matrix_matrix_binary_traits<typename E1::value_type, E1,
                                                      typename E2::value_type, E2>::orientation_category orientation_category;
-        return prod (e1, e2, storage_category (), orientation_category ());
+        return prod (e1, e2, b, storage_category (), orientation_category ());
+    }
+
+    template<class E1, class E2>
+    BOOST_UBLAS_INLINE
+    typename matrix_matrix_binary_traits<typename E1::value_type, E1,
+                                         typename E2::value_type, E2>::result_type
+    prod (const matrix_expression<E1> &e1,
+          const matrix_expression<E2> &e2) {
+        BOOST_STATIC_ASSERT (E1::complexity == 0 && E2::complexity == 0);
+       typedef typename detail::prod_block_size<typename common_type<typename E1::value_type,
+                                                                     typename E2::value_type>::type> block_sizes;
+
+        return prod (e1, e2, block_sizes());
     }
 
     template<class E1, class E2>
@@ -5529,13 +5589,74 @@ namespace boost { namespace numeric { namespace ublas {
         return prec_prod (e1, e2, storage_category (), orientation_category ());
     }
 
-    template<class M, class E1, class E2>
+    template<class M, class E1, class E2, typename B>
+    BOOST_UBLAS_INLINE
+    void
+    prod (const matrix_expression<E1> &e1,
+          const matrix_expression<E2> &e2,
+          M &m, B b,
+         unknown_storage_tag,
+          unknown_orientation_tag) {
+        typedef typename matrix_matrix_binary_traits<typename E1::value_type, E1,
+                                                     typename E2::value_type, E2>::expression_type expression_type;
+
+        m.assign(expression_type (e1 (), e2 ()));
+    }
+
+    template<class M, class E1, class E2, typename B>
+    BOOST_UBLAS_INLINE
+    void
+    prod (const matrix_expression<E1> &e1,
+          const matrix_expression<E2> &e2,
+          M &m, B b,
+         dense_proxy_tag,
+          unknown_orientation_tag) {
+        typedef typename matrix_matrix_binary_traits<typename E1::value_type, E1,
+                                                     typename E2::value_type, E2>::expression_type expression_type;
+        typedef typename matrix_matrix_binary_traits<typename E1::value_type, E1,
+                                                     typename E2::value_type, E2>::result_type result_type;
+       typedef typename result_type::value_type result_value;
+
+       if (e1 ().size1() < B::limit || e2 ().size2() < B::limit) {
+            m.assign(expression_type (e1 (), e2 ()));
+       } else {
+           detail::gemm(result_value(1), e1, e2,
+                        result_value(0), m, b);
+       }
+    }
+
+    // dispatcher
+      template<class M, class E1, class E2, class B>
     BOOST_UBLAS_INLINE
     M &
     prod (const matrix_expression<E1> &e1,
           const matrix_expression<E2> &e2,
+          M &m, B b) {
+        BOOST_STATIC_ASSERT (E1::complexity == 0 && E2::complexity == 0);
+        typedef typename matrix_matrix_binary_traits<typename E1::value_type, E1,
+                                                     typename E2::value_type, E2>::storage_category storage_category;
+        typedef typename matrix_matrix_binary_traits<typename E1::value_type, E1,
+                                                     typename E2::value_type, E2>::orientation_category orientation_category;
+
+        prod (e1, e2, m, b, storage_category(), orientation_category());
+       return m;
+    }
+    template<class M, class E1, class E2>
+    BOOST_UBLAS_INLINE
+    typename enable_if_c<!detail::is_blocksize<M>::value,
+                        M&>::type
+    prod (const matrix_expression<E1> &e1,
+          const matrix_expression<E2> &e2,
           M &m) {
-        return m.assign (prod (e1, e2));
+        BOOST_STATIC_ASSERT (E1::complexity == 0 && E2::complexity == 0);
+        typedef typename matrix_matrix_binary_traits<typename E1::value_type, E1,
+                                                     typename E2::value_type, E2>::storage_category storage_category;
+        typedef typename matrix_matrix_binary_traits<typename E1::value_type, E1,
+                                                     typename E2::value_type, E2>::orientation_category orientation_category;
+       typedef typename detail::prod_block_size<typename common_type<typename E1::value_type,
+                                                                     typename E2::value_type>::type> block_sizes;
+        prod (e1, e2, m, block_sizes(), storage_category(), orientation_category());
+       return m;
     }
 
     template<class M, class E1, class E2>
diff --git a/include/boost/numeric/ublas/operation.hpp b/include/boost/numeric/ublas/operation.hpp
index 64657cc..393a9bd 100644
--- a/include/boost/numeric/ublas/operation.hpp
+++ b/include/boost/numeric/ublas/operation.hpp
@@ -14,7 +14,7 @@
 #define _BOOST_UBLAS_OPERATION_
 
 #include <boost/numeric/ublas/matrix_proxy.hpp>
-
+#include <boost/numeric/ublas/detail/gemm.hpp>
 /** \file operation.hpp
  *  \brief This file contains some specialized products.
  */
@@ -637,7 +637,63 @@ namespace boost { namespace numeric { namespace ublas {
         return m;
     }
 
-    // Dispatcher
+    template<class M, class E1, class E2, class S, class B>
+    BOOST_UBLAS_INLINE
+    M&
+    axpy_prod (const matrix_expression<E1> &e1,
+               const matrix_expression<E2> &e2,
+               M &m, full,
+              S, bool init, B)
+    {
+        typedef typename M::value_type value_type;
+       typedef S storage_category;
+       typedef typename M::orientation_category orientation_category;
+
+
+       if (init)
+         m.assign (zero_matrix<value_type> (e1 ().size1 (), e2 ().size2 ()));
+       return axpy_prod (e1, e2, m, full (), storage_category (),
+                         orientation_category ());
+    }
+
+    template<class M, class E1, class E2, class B>
+    BOOST_UBLAS_INLINE
+    M&
+    axpy_prod (const matrix_expression<E1> &e1,
+               const matrix_expression<E2> &e2,
+               M &m, full,
+              dense_proxy_tag, bool init, B)
+    {
+        typedef typename M::value_type value_type;
+       typedef B block_sizes;
+        if (m.size1() < B::limit || m.size2() < B::limit) {
+            typedef typename M::storage_category storage_category;
+            typedef typename M::orientation_category orientation_category;
+
+            if (init)
+                m.assign (zero_matrix<value_type> (e1 ().size1 (), e2 ().size2 ()));
+            return axpy_prod (e1, e2, m, full (), storage_category (),
+                             orientation_category ());
+        } else {
+            detail::gemm(value_type(1), e1, e2,
+                        value_type(init? 0 : 1), m, block_sizes());
+            return m;
+        }
+    }
+
+    template<class M, class E1, class E2, class B>
+    BOOST_UBLAS_INLINE
+    M&
+    axpy_prod (const matrix_expression<E1> &e1,
+               const matrix_expression<E2> &e2,
+               M &m, full,
+              dense_tag, bool init, B b)
+    {
+       return axpy_prod (e1, e2, m, full (), dense_proxy_tag(),
+                         init, b);
+    }
+
+    // Dispatchers
     template<class M, class E1, class E2, class TRI>
     BOOST_UBLAS_INLINE
     M &
@@ -651,11 +707,16 @@ namespace boost { namespace numeric { namespace ublas {
 
         if (init)
             m.assign (zero_matrix<value_type> (e1 ().size1 (), e2 ().size2 ()));
-        return axpy_prod (e1, e2, m, triangular_restriction (), storage_category (), orientation_category ());
+
+        return axpy_prod(e1, e2, m, triangular_restriction (),
+                        storage_category (), orientation_category(),
+                        init);
     }
+
     template<class M, class E1, class E2, class TRI>
     BOOST_UBLAS_INLINE
-    M
+    typename enable_if_c<!detail::is_blocksize<TRI>::value,
+                               M>::type
     axpy_prod (const matrix_expression<E1> &e1,
                const matrix_expression<E2> &e2,
                TRI) {
@@ -690,29 +751,61 @@ namespace boost { namespace numeric { namespace ublas {
           \param E1 type of a matrix expression \c A
           \param E2 type of a matrix expression \c X
   */
+    template<class M, class E1, class E2, class B>
+    BOOST_UBLAS_INLINE
+    M &
+    axpy_prod (const matrix_expression<E1> &e1,
+               const matrix_expression<E2> &e2,
+               M &m, bool init, B) {
+        typedef typename M::storage_category storage_category;
+       typedef B block_sizes;
+
+        return axpy_prod(e1, e2, m, full (), storage_category (),
+                        init, block_sizes());
+    }
+
     template<class M, class E1, class E2>
     BOOST_UBLAS_INLINE
     M &
     axpy_prod (const matrix_expression<E1> &e1,
                const matrix_expression<E2> &e2,
                M &m, bool init = true) {
-        typedef typename M::value_type value_type;
         typedef typename M::storage_category storage_category;
-        typedef typename M::orientation_category orientation_category;
+       typedef typename detail::prod_block_size<typename common_type<typename E1::value_type,
+                                                                     typename E2::value_type>::type> block_sizes;
 
-        if (init)
-            m.assign (zero_matrix<value_type> (e1 ().size1 (), e2 ().size2 ()));
-        return axpy_prod (e1, e2, m, full (), storage_category (), orientation_category ());
+        return axpy_prod(e1, e2, m, full (), storage_category (),
+                        init, block_sizes());
+    }
+
+    template<class M, class E1, class E2, class B>
+    BOOST_UBLAS_INLINE
+    typename boost::enable_if_c<detail::is_blocksize<B>::value,
+                               M>::type
+    axpy_prod (const matrix_expression<E1> &e1,
+               const matrix_expression<E2> &e2, B) {
+        typedef M matrix_type;
+        typedef typename M::storage_category storage_category;
+       typedef B block_sizes;
+
+        matrix_type m (e1 ().size1 (), e2 ().size2 ());
+        return axpy_prod(e1, e2, m, full (), storage_category(), true,
+                        block_sizes());
     }
+
     template<class M, class E1, class E2>
     BOOST_UBLAS_INLINE
     M
     axpy_prod (const matrix_expression<E1> &e1,
                const matrix_expression<E2> &e2) {
         typedef M matrix_type;
+        typedef typename M::storage_category storage_category;
+       typedef typename detail::prod_block_size<typename common_type<typename E1::value_type,
+                                                                     typename E2::value_type>::type> block_sizes;
 
         matrix_type m (e1 ().size1 (), e2 ().size2 ());
-        return axpy_prod (e1, e2, m, full (), true);
+        return axpy_prod(e1, e2, m, full (), storage_category(), true,
+                        block_sizes());
     }
 
 
@@ -825,6 +918,47 @@ namespace boost { namespace numeric { namespace ublas {
         return opb_prod (e1, e2, m, true);
     }
 
+  /** \brief computes <tt>C = alpha * A * B + beta * C</tt> in an
+          optimized fashion.
+
+         \param alpha scalar multiplier
+          \param e1 the matrix expression \c A
+          \param e2 the matrix expression \c B
+         \param beta scalar multiplier
+          \param e3  the result matrix \c C
+
+         <tt>gemm</tt> implements the well known gemm operation
+
+          \ingroup blas3
+
+          \internal
+
+          template parameters:
+          \param E1 type of a matrix expression \c A
+          \param E2 type of a matrix expression \c B
+          \param E3 type of a matrix expression \c C
+  */
+    template <typename E1, typename E2, typename E3>
+    BOOST_UBLAS_INLINE
+    void
+    gemm(typename E3::value_type alpha, const matrix_expression<E1> &e1,
+         const matrix_expression<E2> &e2,
+        typename E3::value_type beta, matrix_expression<E3> &e3) {
+        typedef typename detail::prod_block_size<typename common_type<typename E1::value_type,
+                                                                     typename E2::value_type>::type> block_sizes;
+       detail::gemm(alpha, e1, e2, beta, e3, block_sizes());
+    }
+
+    template <typename E1, typename E2, typename E3, typename BlockSize>
+    BOOST_UBLAS_INLINE
+    void
+    gemm(typename E3::value_type alpha, const matrix_expression<E1> &e1,
+         const matrix_expression<E2> &e2,
+        typename E3::value_type beta, matrix_expression<E3> &e3,
+        BlockSize b) {
+       detail::gemm(alpha, e1, e2, beta, e3, b);
+    }
+
 }}}
 
 #endif
--
1.9.1

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

palik imre
In reply to this post by palik imre
This patch increases the range of BLAS level 3 benchmarks for dense
matrices up to 1000*1000 matrices.

Signed-off-by: Imre Palik <[hidden email]>
---
 benchmarks/bench1/bench1.cpp  | 14 ++++++++++----
 benchmarks/bench1/bench13.cpp |  8 ++++++++
 benchmarks/bench3/bench3.cpp  | 14 ++++++++++----
 benchmarks/bench3/bench33.cpp |  8 ++++++++
 4 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/benchmarks/bench1/bench1.cpp b/benchmarks/bench1/bench1.cpp
index 87478e1..300eafa 100644
--- a/benchmarks/bench1/bench1.cpp
+++ b/benchmarks/bench1/bench1.cpp
@@ -76,22 +76,28 @@ void do_bench (std::string type_string, int scale)
     header (type_string + ", 3");
     bench_1<scalar, 3> () (1000000 * scale);
     bench_2<scalar, 3> () (300000 * scale);
-    bench_3<scalar, 3> () (100000 * scale);
+    bench_3<scalar, 3> () (3000000 * scale);
 
     header (type_string + ", 10");
     bench_1<scalar, 10> () (300000 * scale);
     bench_2<scalar, 10> () (30000 * scale);
-    bench_3<scalar, 10> () (3000 * scale);
+    bench_3<scalar, 10> () (100000 * scale);
 
     header (type_string + ", 30");
     bench_1<scalar, 30> () (100000 * scale);
     bench_2<scalar, 30> () (3000 * scale);
-    bench_3<scalar, 30> () (100 * scale);
+    bench_3<scalar, 30> () (30000 * scale);
 
     header (type_string + ", 100");
     bench_1<scalar, 100> () (30000 * scale);
     bench_2<scalar, 100> () (300 * scale);
-    bench_3<scalar, 100> () (3 * scale);
+    bench_3<scalar, 100> () (1000 * scale);
+
+    header (type_string + ", 300");
+    bench_3<scalar, 300> () (30 * scale);
+
+    header (type_string + ", 1000");
+    bench_3<scalar, 1000> () (1 * scale);
 }
 
 int main (int argc, char *argv []) {
diff --git a/benchmarks/bench1/bench13.cpp b/benchmarks/bench1/bench13.cpp
index fadb0b6..2378d46 100644
--- a/benchmarks/bench1/bench13.cpp
+++ b/benchmarks/bench1/bench13.cpp
@@ -166,6 +166,8 @@ template struct bench_3<float, 3>;
 template struct bench_3<float, 10>;
 template struct bench_3<float, 30>;
 template struct bench_3<float, 100>;
+template struct bench_3<float, 300>;
+template struct bench_3<float, 1000>;
 #endif
 
 #ifdef USE_DOUBLE
@@ -173,6 +175,8 @@ template struct bench_3<double, 3>;
 template struct bench_3<double, 10>;
 template struct bench_3<double, 30>;
 template struct bench_3<double, 100>;
+template struct bench_3<double, 300>;
+template struct bench_3<double, 1000>;
 #endif
 
 #ifdef USE_STD_COMPLEX
@@ -181,6 +185,8 @@ template struct bench_3<std::complex<float>, 3>;
 template struct bench_3<std::complex<float>, 10>;
 template struct bench_3<std::complex<float>, 30>;
 template struct bench_3<std::complex<float>, 100>;
+template struct bench_3<std::complex<float>, 300>;
+template struct bench_3<std::complex<float>, 1000>;
 #endif
 
 #ifdef USE_DOUBLE
@@ -188,5 +194,7 @@ template struct bench_3<std::complex<double>, 3>;
 template struct bench_3<std::complex<double>, 10>;
 template struct bench_3<std::complex<double>, 30>;
 template struct bench_3<std::complex<double>, 100>;
+template struct bench_3<std::complex<double>, 300>;
+template struct bench_3<std::complex<double>, 1000>;
 #endif
 #endif
diff --git a/benchmarks/bench3/bench3.cpp b/benchmarks/bench3/bench3.cpp
index 390d226..72a3db5 100644
--- a/benchmarks/bench3/bench3.cpp
+++ b/benchmarks/bench3/bench3.cpp
@@ -76,22 +76,28 @@ void do_bench (std::string type_string, int scale)
     header (type_string + ", 3");
     bench_1<scalar, 3> () (1000000 * scale);
     bench_2<scalar, 3> () (300000 * scale);
-    bench_3<scalar, 3> () (100000 * scale);
+    bench_3<scalar, 3> () (3000000 * scale);
 
     header (type_string + ", 10");
     bench_1<scalar, 10> () (300000 * scale);
     bench_2<scalar, 10> () (30000 * scale);
-    bench_3<scalar, 10> () (3000 * scale);
+    bench_3<scalar, 10> () (100000 * scale);
 
     header (type_string + ", 30");
     bench_1<scalar, 30> () (100000 * scale);
     bench_2<scalar, 30> () (3000 * scale);
-    bench_3<scalar, 30> () (100 * scale);
+    bench_3<scalar, 30> () (30000 * scale);
 
     header (type_string + ", 100");
     bench_1<scalar, 100> () (30000 * scale);
     bench_2<scalar, 100> () (300 * scale);
-    bench_3<scalar, 100> () (3 * scale);
+    bench_3<scalar, 100> () (1000 * scale);
+
+    header (type_string + ", 300");
+    bench_3<scalar, 300> () (30 * scale);
+
+    header (type_string + ", 1000");
+    bench_3<scalar, 1000> () (1 * scale);
 }
 
 int main (int argc, char *argv []) {
diff --git a/benchmarks/bench3/bench33.cpp b/benchmarks/bench3/bench33.cpp
index 9b8e107..7eeec07 100644
--- a/benchmarks/bench3/bench33.cpp
+++ b/benchmarks/bench3/bench33.cpp
@@ -172,6 +172,8 @@ template struct bench_3<float, 3>;
 template struct bench_3<float, 10>;
 template struct bench_3<float, 30>;
 template struct bench_3<float, 100>;
+template struct bench_3<float, 300>;
+template struct bench_3<float, 1000>;
 #endif
 
 #ifdef USE_DOUBLE
@@ -179,6 +181,8 @@ template struct bench_3<double, 3>;
 template struct bench_3<double, 10>;
 template struct bench_3<double, 30>;
 template struct bench_3<double, 100>;
+template struct bench_3<double, 300>;
+template struct bench_3<double, 1000>;
 #endif
 
 #ifdef USE_STD_COMPLEX
@@ -187,6 +191,8 @@ template struct bench_3<std::complex<float>, 3>;
 template struct bench_3<std::complex<float>, 10>;
 template struct bench_3<std::complex<float>, 30>;
 template struct bench_3<std::complex<float>, 100>;
+template struct bench_3<std::complex<float>, 300>;
+template struct bench_3<std::complex<float>, 1000>;
 #endif
 
 #ifdef USE_DOUBLE
@@ -194,5 +200,7 @@ template struct bench_3<std::complex<double>, 3>;
 template struct bench_3<std::complex<double>, 10>;
 template struct bench_3<std::complex<double>, 30>;
 template struct bench_3<std::complex<double>, 100>;
+template struct bench_3<std::complex<double>, 300>;
+template struct bench_3<std::complex<double>, 1000>;
 #endif
 #endif
--
1.9.1

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

Nasos Iliopoulos
Just checking,
I may have lost reference but I don't see any pull requests on any
branch on ublas/ublas. Do you want me to create one  so it is easier for
you guys?

-Nasos


On 02/29/2016 02:46 AM, Imre Palik wrote:

> This patch increases the range of BLAS level 3 benchmarks for dense
> matrices up to 1000*1000 matrices.
>
> Signed-off-by: Imre Palik <[hidden email]>
> ---
>   benchmarks/bench1/bench1.cpp  | 14 ++++++++++----
>   benchmarks/bench1/bench13.cpp |  8 ++++++++
>   benchmarks/bench3/bench3.cpp  | 14 ++++++++++----
>   benchmarks/bench3/bench33.cpp |  8 ++++++++
>   4 files changed, 36 insertions(+), 8 deletions(-)
>
> diff --git a/benchmarks/bench1/bench1.cpp b/benchmarks/bench1/bench1.cpp
> index 87478e1..300eafa 100644
> --- a/benchmarks/bench1/bench1.cpp
> +++ b/benchmarks/bench1/bench1.cpp
> @@ -76,22 +76,28 @@ void do_bench (std::string type_string, int scale)
>       header (type_string + ", 3");
>       bench_1<scalar, 3> () (1000000 * scale);
>       bench_2<scalar, 3> () (300000 * scale);
> -    bench_3<scalar, 3> () (100000 * scale);
> +    bench_3<scalar, 3> () (3000000 * scale);
>  
>       header (type_string + ", 10");
>       bench_1<scalar, 10> () (300000 * scale);
>       bench_2<scalar, 10> () (30000 * scale);
> -    bench_3<scalar, 10> () (3000 * scale);
> +    bench_3<scalar, 10> () (100000 * scale);
>  
>       header (type_string + ", 30");
>       bench_1<scalar, 30> () (100000 * scale);
>       bench_2<scalar, 30> () (3000 * scale);
> -    bench_3<scalar, 30> () (100 * scale);
> +    bench_3<scalar, 30> () (30000 * scale);
>  
>       header (type_string + ", 100");
>       bench_1<scalar, 100> () (30000 * scale);
>       bench_2<scalar, 100> () (300 * scale);
> -    bench_3<scalar, 100> () (3 * scale);
> +    bench_3<scalar, 100> () (1000 * scale);
> +
> +    header (type_string + ", 300");
> +    bench_3<scalar, 300> () (30 * scale);
> +
> +    header (type_string + ", 1000");
> +    bench_3<scalar, 1000> () (1 * scale);
>   }
>  
>   int main (int argc, char *argv []) {
> diff --git a/benchmarks/bench1/bench13.cpp b/benchmarks/bench1/bench13.cpp
> index fadb0b6..2378d46 100644
> --- a/benchmarks/bench1/bench13.cpp
> +++ b/benchmarks/bench1/bench13.cpp
> @@ -166,6 +166,8 @@ template struct bench_3<float, 3>;
>   template struct bench_3<float, 10>;
>   template struct bench_3<float, 30>;
>   template struct bench_3<float, 100>;
> +template struct bench_3<float, 300>;
> +template struct bench_3<float, 1000>;
>   #endif
>  
>   #ifdef USE_DOUBLE
> @@ -173,6 +175,8 @@ template struct bench_3<double, 3>;
>   template struct bench_3<double, 10>;
>   template struct bench_3<double, 30>;
>   template struct bench_3<double, 100>;
> +template struct bench_3<double, 300>;
> +template struct bench_3<double, 1000>;
>   #endif
>  
>   #ifdef USE_STD_COMPLEX
> @@ -181,6 +185,8 @@ template struct bench_3<std::complex<float>, 3>;
>   template struct bench_3<std::complex<float>, 10>;
>   template struct bench_3<std::complex<float>, 30>;
>   template struct bench_3<std::complex<float>, 100>;
> +template struct bench_3<std::complex<float>, 300>;
> +template struct bench_3<std::complex<float>, 1000>;
>   #endif
>  
>   #ifdef USE_DOUBLE
> @@ -188,5 +194,7 @@ template struct bench_3<std::complex<double>, 3>;
>   template struct bench_3<std::complex<double>, 10>;
>   template struct bench_3<std::complex<double>, 30>;
>   template struct bench_3<std::complex<double>, 100>;
> +template struct bench_3<std::complex<double>, 300>;
> +template struct bench_3<std::complex<double>, 1000>;
>   #endif
>   #endif
> diff --git a/benchmarks/bench3/bench3.cpp b/benchmarks/bench3/bench3.cpp
> index 390d226..72a3db5 100644
> --- a/benchmarks/bench3/bench3.cpp
> +++ b/benchmarks/bench3/bench3.cpp
> @@ -76,22 +76,28 @@ void do_bench (std::string type_string, int scale)
>       header (type_string + ", 3");
>       bench_1<scalar, 3> () (1000000 * scale);
>       bench_2<scalar, 3> () (300000 * scale);
> -    bench_3<scalar, 3> () (100000 * scale);
> +    bench_3<scalar, 3> () (3000000 * scale);
>  
>       header (type_string + ", 10");
>       bench_1<scalar, 10> () (300000 * scale);
>       bench_2<scalar, 10> () (30000 * scale);
> -    bench_3<scalar, 10> () (3000 * scale);
> +    bench_3<scalar, 10> () (100000 * scale);
>  
>       header (type_string + ", 30");
>       bench_1<scalar, 30> () (100000 * scale);
>       bench_2<scalar, 30> () (3000 * scale);
> -    bench_3<scalar, 30> () (100 * scale);
> +    bench_3<scalar, 30> () (30000 * scale);
>  
>       header (type_string + ", 100");
>       bench_1<scalar, 100> () (30000 * scale);
>       bench_2<scalar, 100> () (300 * scale);
> -    bench_3<scalar, 100> () (3 * scale);
> +    bench_3<scalar, 100> () (1000 * scale);
> +
> +    header (type_string + ", 300");
> +    bench_3<scalar, 300> () (30 * scale);
> +
> +    header (type_string + ", 1000");
> +    bench_3<scalar, 1000> () (1 * scale);
>   }
>  
>   int main (int argc, char *argv []) {
> diff --git a/benchmarks/bench3/bench33.cpp b/benchmarks/bench3/bench33.cpp
> index 9b8e107..7eeec07 100644
> --- a/benchmarks/bench3/bench33.cpp
> +++ b/benchmarks/bench3/bench33.cpp
> @@ -172,6 +172,8 @@ template struct bench_3<float, 3>;
>   template struct bench_3<float, 10>;
>   template struct bench_3<float, 30>;
>   template struct bench_3<float, 100>;
> +template struct bench_3<float, 300>;
> +template struct bench_3<float, 1000>;
>   #endif
>  
>   #ifdef USE_DOUBLE
> @@ -179,6 +181,8 @@ template struct bench_3<double, 3>;
>   template struct bench_3<double, 10>;
>   template struct bench_3<double, 30>;
>   template struct bench_3<double, 100>;
> +template struct bench_3<double, 300>;
> +template struct bench_3<double, 1000>;
>   #endif
>  
>   #ifdef USE_STD_COMPLEX
> @@ -187,6 +191,8 @@ template struct bench_3<std::complex<float>, 3>;
>   template struct bench_3<std::complex<float>, 10>;
>   template struct bench_3<std::complex<float>, 30>;
>   template struct bench_3<std::complex<float>, 100>;
> +template struct bench_3<std::complex<float>, 300>;
> +template struct bench_3<std::complex<float>, 1000>;
>   #endif
>  
>   #ifdef USE_DOUBLE
> @@ -194,5 +200,7 @@ template struct bench_3<std::complex<double>, 3>;
>   template struct bench_3<std::complex<double>, 10>;
>   template struct bench_3<std::complex<double>, 30>;
>   template struct bench_3<std::complex<double>, 100>;
> +template struct bench_3<std::complex<double>, 300>;
> +template struct bench_3<std::complex<double>, 1000>;
>   #endif
>   #endif

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

palik imre
I asked about the development process before, and haven't received any response ...

Does this means, that you do the pull request thing, instead of posting patches to mailing lists?

If yes, then I'll create the github branch,next weekend.

Cheers,

Imre


On Monday, 29 February 2016, 16:21, Nasos Iliopoulos <[hidden email]> wrote:


Just checking,
I may have lost reference but I don't see any pull requests on any
branch on ublas/ublas. Do you want me to create one  so it is easier for
you guys?

-Nasos


On 02/29/2016 02:46 AM, Imre Palik wrote:

> This patch increases the range of BLAS level 3 benchmarks for dense
> matrices up to 1000*1000 matrices.
>
> Signed-off-by: Imre Palik <[hidden email]>
> ---
>  benchmarks/bench1/bench1.cpp  | 14 ++++++++++----
>  benchmarks/bench1/bench13.cpp |  8 ++++++++
>  benchmarks/bench3/bench3.cpp  | 14 ++++++++++----
>  benchmarks/bench3/bench33.cpp |  8 ++++++++
>  4 files changed, 36 insertions(+), 8 deletions(-)
>
> diff --git a/benchmarks/bench1/bench1.cpp b/benchmarks/bench1/bench1.cpp
> index 87478e1..300eafa 100644
> --- a/benchmarks/bench1/bench1.cpp
> +++ b/benchmarks/bench1/bench1.cpp
> @@ -76,22 +76,28 @@ void do_bench (std::string type_string, int scale)
>      header (type_string + ", 3");
>      bench_1<scalar, 3> () (1000000 * scale);
>      bench_2<scalar, 3> () (300000 * scale);
> -    bench_3<scalar, 3> () (100000 * scale);
> +    bench_3<scalar, 3> () (3000000 * scale);

>      header (type_string + ", 10");
>      bench_1<scalar, 10> () (300000 * scale);
>      bench_2<scalar, 10> () (30000 * scale);
> -    bench_3<scalar, 10> () (3000 * scale);
> +    bench_3<scalar, 10> () (100000 * scale);

>      header (type_string + ", 30");
>      bench_1<scalar, 30> () (100000 * scale);
>      bench_2<scalar, 30> () (3000 * scale);
> -    bench_3<scalar, 30> () (100 * scale);
> +    bench_3<scalar, 30> () (30000 * scale);

>      header (type_string + ", 100");
>      bench_1<scalar, 100> () (30000 * scale);
>      bench_2<scalar, 100> () (300 * scale);
> -    bench_3<scalar, 100> () (3 * scale);
> +    bench_3<scalar, 100> () (1000 * scale);
> +
> +    header (type_string + ", 300");
> +    bench_3<scalar, 300> () (30 * scale);
> +
> +    header (type_string + ", 1000");
> +    bench_3<scalar, 1000> () (1 * scale);
>  }

>  int main (int argc, char *argv []) {
> diff --git a/benchmarks/bench1/bench13.cpp b/benchmarks/bench1/bench13.cpp
> index fadb0b6..2378d46 100644
> --- a/benchmarks/bench1/bench13.cpp
> +++ b/benchmarks/bench1/bench13.cpp
> @@ -166,6 +166,8 @@ template struct bench_3<float, 3>;
>  template struct bench_3<float, 10>;
>  template struct bench_3<float, 30>;
>  template struct bench_3<float, 100>;
> +template struct bench_3<float, 300>;
> +template struct bench_3<float, 1000>;
>  #endif

>  #ifdef USE_DOUBLE
> @@ -173,6 +175,8 @@ template struct bench_3<double, 3>;
>  template struct bench_3<double, 10>;
>  template struct bench_3<double, 30>;
>  template struct bench_3<double, 100>;
> +template struct bench_3<double, 300>;
> +template struct bench_3<double, 1000>;
>  #endif

>  #ifdef USE_STD_COMPLEX
> @@ -181,6 +185,8 @@ template struct bench_3<std::complex<float>, 3>;
>  template struct bench_3<std::complex<float>, 10>;
>  template struct bench_3<std::complex<float>, 30>;
>  template struct bench_3<std::complex<float>, 100>;
> +template struct bench_3<std::complex<float>, 300>;
> +template struct bench_3<std::complex<float>, 1000>;
>  #endif

>  #ifdef USE_DOUBLE
> @@ -188,5 +194,7 @@ template struct bench_3<std::complex<double>, 3>;
>  template struct bench_3<std::complex<double>, 10>;
>  template struct bench_3<std::complex<double>, 30>;
>  template struct bench_3<std::complex<double>, 100>;
> +template struct bench_3<std::complex<double>, 300>;
> +template struct bench_3<std::complex<double>, 1000>;
>  #endif
>  #endif
> diff --git a/benchmarks/bench3/bench3.cpp b/benchmarks/bench3/bench3.cpp
> index 390d226..72a3db5 100644
> --- a/benchmarks/bench3/bench3.cpp
> +++ b/benchmarks/bench3/bench3.cpp
> @@ -76,22 +76,28 @@ void do_bench (std::string type_string, int scale)
>      header (type_string + ", 3");
>      bench_1<scalar, 3> () (1000000 * scale);
>      bench_2<scalar, 3> () (300000 * scale);
> -    bench_3<scalar, 3> () (100000 * scale);
> +    bench_3<scalar, 3> () (3000000 * scale);

>      header (type_string + ", 10");
>      bench_1<scalar, 10> () (300000 * scale);
>      bench_2<scalar, 10> () (30000 * scale);
> -    bench_3<scalar, 10> () (3000 * scale);
> +    bench_3<scalar, 10> () (100000 * scale);

>      header (type_string + ", 30");
>      bench_1<scalar, 30> () (100000 * scale);
>      bench_2<scalar, 30> () (3000 * scale);
> -    bench_3<scalar, 30> () (100 * scale);
> +    bench_3<scalar, 30> () (30000 * scale);

>      header (type_string + ", 100");
>      bench_1<scalar, 100> () (30000 * scale);
>      bench_2<scalar, 100> () (300 * scale);
> -    bench_3<scalar, 100> () (3 * scale);
> +    bench_3<scalar, 100> () (1000 * scale);
> +
> +    header (type_string + ", 300");
> +    bench_3<scalar, 300> () (30 * scale);
> +
> +    header (type_string + ", 1000");
> +    bench_3<scalar, 1000> () (1 * scale);
>  }

>  int main (int argc, char *argv []) {
> diff --git a/benchmarks/bench3/bench33.cpp b/benchmarks/bench3/bench33.cpp
> index 9b8e107..7eeec07 100644
> --- a/benchmarks/bench3/bench33.cpp
> +++ b/benchmarks/bench3/bench33.cpp
> @@ -172,6 +172,8 @@ template struct bench_3<float, 3>;
>  template struct bench_3<float, 10>;
>  template struct bench_3<float, 30>;
>  template struct bench_3<float, 100>;
> +template struct bench_3<float, 300>;
> +template struct bench_3<float, 1000>;
>  #endif

>  #ifdef USE_DOUBLE
> @@ -179,6 +181,8 @@ template struct bench_3<double, 3>;
>  template struct bench_3<double, 10>;
>  template struct bench_3<double, 30>;
>  template struct bench_3<double, 100>;
> +template struct bench_3<double, 300>;
> +template struct bench_3<double, 1000>;
>  #endif

>  #ifdef USE_STD_COMPLEX
> @@ -187,6 +191,8 @@ template struct bench_3<std::complex<float>, 3>;
>  template struct bench_3<std::complex<float>, 10>;
>  template struct bench_3<std::complex<float>, 30>;
>  template struct bench_3<std::complex<float>, 100>;
> +template struct bench_3<std::complex<float>, 300>;
> +template struct bench_3<std::complex<float>, 1000>;
>  #endif

>  #ifdef USE_DOUBLE
> @@ -194,5 +200,7 @@ template struct bench_3<std::complex<double>, 3>;
>  template struct bench_3<std::complex<double>, 10>;
>  template struct bench_3<std::complex<double>, 30>;
>  template struct bench_3<std::complex<double>, 100>;
> +template struct bench_3<std::complex<double>, 300>;
> +template struct bench_3<std::complex<double>, 1000>;
>  #endif
>  #endif

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]




_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

Nasos Iliopoulos
The development process is as follows:

1. Create your fork from https://github.com/uBLAS/ublas (NOT boostorg). In that fork you could create a branch that you work on, or use an already existing branch. For the simd gemm I just created a branch so you don't need to create one (https://github.com/uBLAS/ublas/tree/feature/ublas00004_simd_gemm). For quick changes we are just using the develop branch.

2. Setup your local environment as described in : https://github.com/uBLAS/ublas/wiki. Please replace the submodule in step "3. Replace the default submodule with the ublas dev repository:"  with your fork. Please note that many of the instructions in the wiki refer to users that have write access in ublas/ublas. Many apply though for your fork that you will have write access to.

3. Work in your fork and branch according to regular git process (checkout the feature/ublas00004_simd_gemm branch, and perform regular:
git add ...
git commit -am "Change text ...."
git push
please note that these need to be executed in your forked ublas repository.)
* Here I have put the most used git command:
https://github.com/uBLAS/ublas/wiki/Basic-GIT-commands

4. When done working on your branch and want to incorporate into ublas create a pull request from the github interface and we will take care of:
(a). Merging into develop
(b). Creating a pull request to boostorg/develop so that the test can be run (http://www.boost.org/development/tests/develop/developer/numeric-ublas.html)
(c). David accepting the pull request and
(c). Merge into master if all seem ok.

* Essentially we are trying to stick to:
http://nvie.com/posts/a-successful-git-branching-model/
with the added redirections of requesting pulls to boostorg/ublas ONLY for develop and master Branches.

* If you want to test pull requests and the dev model, play with this branch:
https://github.com/uBLAS/ublas/tree/ublas_sandbox0002_github_playground
(or any playground branch)

* If more than one developers work on the same branch it is essential to synchronize their efforts so that there are not serious merge conflicts.

* If you get accommodated with git and github  you may want to deviate from the process a bit as you may see fit.

* Other people may have suggestions or tips if they are doing something differently.

Please let me know if you have any issues.

-Nasos



On 02/29/2016 10:42 AM, palik imre wrote:
I asked about the development process before, and haven't received any response ...

Does this means, that you do the pull request thing, instead of posting patches to mailing lists?

If yes, then I'll create the github branch,next weekend.

Cheers,

Imre


On Monday, 29 February 2016, 16:21, Nasos Iliopoulos [hidden email] wrote:


Just checking,
I may have lost reference but I don't see any pull requests on any
branch on ublas/ublas. Do you want me to create one  so it is easier for
you guys?

-Nasos


On 02/29/2016 02:46 AM, Imre Palik wrote:
> This patch increases the range of BLAS level 3 benchmarks for dense
> matrices up to 1000*1000 matrices.
>
> Signed-off-by: Imre Palik <[hidden email]>
> ---
>  benchmarks/bench1/bench1.cpp  | 14 ++++++++++----
>  benchmarks/bench1/bench13.cpp |  8 ++++++++
>  benchmarks/bench3/bench3.cpp  | 14 ++++++++++----
>  benchmarks/bench3/bench33.cpp |  8 ++++++++
>  4 files changed, 36 insertions(+), 8 deletions(-)
>
> diff --git a/benchmarks/bench1/bench1.cpp b/benchmarks/bench1/bench1.cpp
> index 87478e1..300eafa 100644
> --- a/benchmarks/bench1/bench1.cpp
> +++ b/benchmarks/bench1/bench1.cpp
> @@ -76,22 +76,28 @@ void do_bench (std::string type_string, int scale)
>      header (type_string + ", 3");
>      bench_1<scalar, 3> () (1000000 * scale);
>      bench_2<scalar, 3> () (300000 * scale);
> -    bench_3<scalar, 3> () (100000 * scale);
> +    bench_3<scalar, 3> () (3000000 * scale);

>      header (type_string + ", 10");
>      bench_1<scalar, 10> () (300000 * scale);
>      bench_2<scalar, 10> () (30000 * scale);
> -    bench_3<scalar, 10> () (3000 * scale);
> +    bench_3<scalar, 10> () (100000 * scale);

>      header (type_string + ", 30");
>      bench_1<scalar, 30> () (100000 * scale);
>      bench_2<scalar, 30> () (3000 * scale);
> -    bench_3<scalar, 30> () (100 * scale);
> +    bench_3<scalar, 30> () (30000 * scale);

>      header (type_string + ", 100");
>      bench_1<scalar, 100> () (30000 * scale);
>      bench_2<scalar, 100> () (300 * scale);
> -    bench_3<scalar, 100> () (3 * scale);
> +    bench_3<scalar, 100> () (1000 * scale);
> +
> +    header (type_string + ", 300");
> +    bench_3<scalar, 300> () (30 * scale);
> +
> +    header (type_string + ", 1000");
> +    bench_3<scalar, 1000> () (1 * scale);
>  }

>  int main (int argc, char *argv []) {
> diff --git a/benchmarks/bench1/bench13.cpp b/benchmarks/bench1/bench13.cpp
> index fadb0b6..2378d46 100644
> --- a/benchmarks/bench1/bench13.cpp
> +++ b/benchmarks/bench1/bench13.cpp
> @@ -166,6 +166,8 @@ template struct bench_3<float, 3>;
>  template struct bench_3<float, 10>;
>  template struct bench_3<float, 30>;
>  template struct bench_3<float, 100>;
> +template struct bench_3<float, 300>;
> +template struct bench_3<float, 1000>;
>  #endif

>  #ifdef USE_DOUBLE
> @@ -173,6 +175,8 @@ template struct bench_3<double, 3>;
>  template struct bench_3<double, 10>;
>  template struct bench_3<double, 30>;
>  template struct bench_3<double, 100>;
> +template struct bench_3<double, 300>;
> +template struct bench_3<double, 1000>;
>  #endif

>  #ifdef USE_STD_COMPLEX
> @@ -181,6 +185,8 @@ template struct bench_3<std::complex<float>, 3>;
>  template struct bench_3<std::complex<float>, 10>;
>  template struct bench_3<std::complex<float>, 30>;
>  template struct bench_3<std::complex<float>, 100>;
> +template struct bench_3<std::complex<float>, 300>;
> +template struct bench_3<std::complex<float>, 1000>;
>  #endif

>  #ifdef USE_DOUBLE
> @@ -188,5 +194,7 @@ template struct bench_3<std::complex<double>, 3>;
>  template struct bench_3<std::complex<double>, 10>;
>  template struct bench_3<std::complex<double>, 30>;
>  template struct bench_3<std::complex<double>, 100>;
> +template struct bench_3<std::complex<double>, 300>;
> +template struct bench_3<std::complex<double>, 1000>;
>  #endif
>  #endif
> diff --git a/benchmarks/bench3/bench3.cpp b/benchmarks/bench3/bench3.cpp
> index 390d226..72a3db5 100644
> --- a/benchmarks/bench3/bench3.cpp
> +++ b/benchmarks/bench3/bench3.cpp
> @@ -76,22 +76,28 @@ void do_bench (std::string type_string, int scale)
>      header (type_string + ", 3");
>      bench_1<scalar, 3> () (1000000 * scale);
>      bench_2<scalar, 3> () (300000 * scale);
> -    bench_3<scalar, 3> () (100000 * scale);
> +    bench_3<scalar, 3> () (3000000 * scale);

>      header (type_string + ", 10");
>      bench_1<scalar, 10> () (300000 * scale);
>      bench_2<scalar, 10> () (30000 * scale);
> -    bench_3<scalar, 10> () (3000 * scale);
> +    bench_3<scalar, 10> () (100000 * scale);

>      header (type_string + ", 30");
>      bench_1<scalar, 30> () (100000 * scale);
>      bench_2<scalar, 30> () (3000 * scale);
> -    bench_3<scalar, 30> () (100 * scale);
> +    bench_3<scalar, 30> () (30000 * scale);

>      header (type_string + ", 100");
>      bench_1<scalar, 100> () (30000 * scale);
>      bench_2<scalar, 100> () (300 * scale);
> -    bench_3<scalar, 100> () (3 * scale);
> +    bench_3<scalar, 100> () (1000 * scale);
> +
> +    header (type_string + ", 300");
> +    bench_3<scalar, 300> () (30 * scale);
> +
> +    header (type_string + ", 1000");
> +    bench_3<scalar, 1000> () (1 * scale);
>  }

>  int main (int argc, char *argv []) {
> diff --git a/benchmarks/bench3/bench33.cpp b/benchmarks/bench3/bench33.cpp
> index 9b8e107..7eeec07 100644
> --- a/benchmarks/bench3/bench33.cpp
> +++ b/benchmarks/bench3/bench33.cpp
> @@ -172,6 +172,8 @@ template struct bench_3<float, 3>;
>  template struct bench_3<float, 10>;
>  template struct bench_3<float, 30>;
>  template struct bench_3<float, 100>;
> +template struct bench_3<float, 300>;
> +template struct bench_3<float, 1000>;
>  #endif

>  #ifdef USE_DOUBLE
> @@ -179,6 +181,8 @@ template struct bench_3<double, 3>;
>  template struct bench_3<double, 10>;
>  template struct bench_3<double, 30>;
>  template struct bench_3<double, 100>;
> +template struct bench_3<double, 300>;
> +template struct bench_3<double, 1000>;
>  #endif

>  #ifdef USE_STD_COMPLEX
> @@ -187,6 +191,8 @@ template struct bench_3<std::complex<float>, 3>;
>  template struct bench_3<std::complex<float>, 10>;
>  template struct bench_3<std::complex<float>, 30>;
>  template struct bench_3<std::complex<float>, 100>;
> +template struct bench_3<std::complex<float>, 300>;
> +template struct bench_3<std::complex<float>, 1000>;
>  #endif

>  #ifdef USE_DOUBLE
> @@ -194,5 +200,7 @@ template struct bench_3<std::complex<double>, 3>;
>  template struct bench_3<std::complex<double>, 10>;
>  template struct bench_3<std::complex<double>, 30>;
>  template struct bench_3<std::complex<double>, 100>;
> +template struct bench_3<std::complex<double>, 300>;
> +template struct bench_3<std::complex<double>, 1000>;
>  #endif
>  #endif

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]





_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

palik imre
In reply to this post by palik imre
Fork is here: https://github.com/imre-palik/ublas/tree/feature/ublas00004_simd_gemm

pull request is sent.




--------------------------------------------
On Mon, 29/2/16, Nasos Iliopoulos <[hidden email]> wrote:

 Subject: Re: [ublas] [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks
 To: [hidden email]
 Date: Monday, 29 February, 2016, 17:52
 
 
     The development process is as follows:
 
     
 
     1. Create your fork from https://github.com/uBLAS/ublas
 (NOT
     boostorg). In that fork you could create a branch that
 you work on,
     or use an already existing branch. For the simd gemm I
 just created
     a branch so you don't need to create one
     (https://github.com/uBLAS/ublas/tree/feature/ublas00004_simd_gemm).
     For quick changes we are just using the develop
 branch.
 
     
 
     2. Setup your local environment as described in :
     https://github.com/uBLAS/ublas/wiki.
 Please replace the submodule in
     step "3. Replace the default submodule with the
 ublas dev
     repository:"  with your fork. Please note that
 many of the
     instructions in the wiki refer to users that have write
 access in
     ublas/ublas. Many apply though for your fork that you
 will have
     write access to.
 
     
 
     3. Work in your fork and branch according to regular git
 process
     (checkout the feature/ublas00004_simd_gemm branch, and
 perform
     regular:
 
     git add ...
 
     git commit -am "Change text ...."
 
     git push
 
     please note that these need to be executed in your
 forked ublas
     repository.)
 
     * Here I have put the most used git command:
 
     https://github.com/uBLAS/ublas/wiki/Basic-GIT-commands
 
     
 
     4. When done working on your branch and want to
 incorporate into
     ublas create a pull request from the github interface
 and we will
     take care of:
 
     (a). Merging into develop
 
     (b). Creating a pull request to boostorg/develop so that
 the test
     can be run
 (http://www.boost.org/development/tests/develop/developer/numeric-ublas.html)
 
     (c). David accepting the pull request and
 
     (c). Merge into master if all seem ok.
 
     
 
     * Essentially we are trying to stick to:
 
     http://nvie.com/posts/a-successful-git-branching-model/
 
     with the added redirections of requesting pulls to
 boostorg/ublas
     ONLY for develop and master Branches.
 
     
 
     * If you want to test pull requests and the dev model,
 play with
     this branch:
 
 https://github.com/uBLAS/ublas/tree/ublas_sandbox0002_github_playground
 
     (or any playground branch)
 
     
 
     * If more than one developers work on the same branch it
 is
     essential to synchronize their efforts so that there are
 not serious
     merge conflicts.
 
     
 
     * If you get accommodated with git and github  you may
 want to
     deviate from the process a bit as you may see fit.
 
     
 
     * Other people may have suggestions or tips if they are
 doing
     something differently.
 
     
 
     Please let me know if you have any issues.
 
     
 
     -Nasos
 
     
 
     
 
     
 
     On 02/29/2016 10:42 AM,
 palik imre
       wrote:
 
     
     
       
         I asked about the development process
 before, and
           haven't received any response ...
         
 
         
         Does
 this
           means, that you do the pull request thing, instead
 of posting
           patches to mailing lists?
         
 
         
         If yes,
           then I'll create the github branch,next
 weekend.
         
 
         
         Cheers,
         
 
         
         Imre
 
         
         
         
 
           
 
         
         
           
             
                On
 Monday, 29
                   February 2016, 16:21, Nasos Iliopoulos
                   <[hidden email]>
 wrote:
 
                 
               
 
               
 
               Just
 checking,
 
                 I may have lost reference but I don't
 see any pull
                 requests on any
 
                 branch on ublas/ublas. Do you want me to
 create one  so
                 it is easier for
 
                 you guys?
 
                 
 
                 -Nasos
 
                 
 
                 
 
                 On 02/29/2016 02:46 AM, Imre Palik wrote:
 
                 > This patch increases the range of BLAS
 level 3
                 benchmarks for dense
 
                 > matrices up to 1000*1000 matrices.
 
                 >
 
                 > Signed-off-by: Imre Palik <[hidden email]>
 
                 > ---
 
                 >  benchmarks/bench1/bench1.cpp  | 14
 ++++++++++----
 
                 >  benchmarks/bench1/bench13.cpp |  8
 ++++++++
 
                 >  benchmarks/bench3/bench3.cpp  | 14
 ++++++++++----
 
                 >  benchmarks/bench3/bench33.cpp |  8
 ++++++++
 
                 >  4 files changed, 36 insertions(+), 8
 deletions(-)
 
                 >
 
                 > diff --git
 a/benchmarks/bench1/bench1.cpp
                 b/benchmarks/bench1/bench1.cpp
 
                 > index 87478e1..300eafa 100644
 
                 > --- a/benchmarks/bench1/bench1.cpp
 
                 > +++ b/benchmarks/bench1/bench1.cpp
 
                 > @@ -76,22 +76,28 @@ void do_bench
 (std::string
                 type_string, int scale)
 
                 >      header (type_string + ",
 3");
 
                 >      bench_1<scalar, 3> ()
 (1000000 * scale);
 
                 >      bench_2<scalar, 3> ()
 (300000 * scale);
 
                 > -    bench_3<scalar, 3> ()
 (100000 * scale);
 
                 > +    bench_3<scalar, 3> ()
 (3000000 * scale);
 
                 > 
 
                 >      header (type_string + ",
 10");
 
                 >      bench_1<scalar, 10> ()
 (300000 * scale);
 
                 >      bench_2<scalar, 10> ()
 (30000 * scale);
 
                 > -    bench_3<scalar, 10> ()
 (3000 * scale);
 
                 > +    bench_3<scalar, 10> ()
 (100000 * scale);
 
                 > 
 
                 >      header (type_string + ",
 30");
 
                 >      bench_1<scalar, 30> ()
 (100000 * scale);
 
                 >      bench_2<scalar, 30> ()
 (3000 * scale);
 
                 > -    bench_3<scalar, 30> ()
 (100 * scale);
 
                 > +    bench_3<scalar, 30> ()
 (30000 * scale);
 
                 > 
 
                 >      header (type_string + ",
 100");
 
                 >      bench_1<scalar, 100> ()
 (30000 * scale);
 
                 >      bench_2<scalar, 100> ()
 (300 * scale);
 
                 > -    bench_3<scalar, 100> () (3
 * scale);
 
                 > +    bench_3<scalar, 100> ()
 (1000 * scale);
 
                 > +
 
                 > +    header (type_string + ",
 300");
 
                 > +    bench_3<scalar, 300> ()
 (30 * scale);
 
                 > +
 
                 > +    header (type_string + ",
 1000");
 
                 > +    bench_3<scalar, 1000> ()
 (1 * scale);
 
                 >  }
 
                 > 
 
                 >  int main (int argc, char *argv [])
 {
 
                 > diff --git
 a/benchmarks/bench1/bench13.cpp
                 b/benchmarks/bench1/bench13.cpp
 
                 > index fadb0b6..2378d46 100644
 
                 > --- a/benchmarks/bench1/bench13.cpp
 
                 > +++ b/benchmarks/bench1/bench13.cpp
 
                 > @@ -166,6 +166,8 @@ template struct
                 bench_3<float, 3>;
 
                 >  template struct bench_3<float,
 10>;
 
                 >  template struct bench_3<float,
 30>;
 
                 >  template struct bench_3<float,
 100>;
 
                 > +template struct bench_3<float,
 300>;
 
                 > +template struct bench_3<float,
 1000>;
 
                 >  #endif
 
                 > 
 
                 >  #ifdef USE_DOUBLE
 
                 > @@ -173,6 +175,8 @@ template struct
                 bench_3<double, 3>;
 
                 >  template struct bench_3<double,
 10>;
 
                 >  template struct bench_3<double,
 30>;
 
                 >  template struct bench_3<double,
 100>;
 
                 > +template struct bench_3<double,
 300>;
 
                 > +template struct bench_3<double,
 1000>;
 
                 >  #endif
 
                 > 
 
                 >  #ifdef USE_STD_COMPLEX
 
                 > @@ -181,6 +185,8 @@ template struct
                 bench_3<std::complex<float>,
 3>;
 
                 >  template struct
                 bench_3<std::complex<float>,
 10>;
 
                 >  template struct
                 bench_3<std::complex<float>,
 30>;
 
                 >  template struct
                 bench_3<std::complex<float>,
 100>;
 
                 > +template struct
                 bench_3<std::complex<float>,
 300>;
 
                 > +template struct
                 bench_3<std::complex<float>,
 1000>;
 
                 >  #endif
 
                 > 
 
                 >  #ifdef USE_DOUBLE
 
                 > @@ -188,5 +194,7 @@ template struct
                 bench_3<std::complex<double>,
 3>;
 
                 >  template struct
                 bench_3<std::complex<double>,
 10>;
 
                 >  template struct
                 bench_3<std::complex<double>,
 30>;
 
                 >  template struct
                 bench_3<std::complex<double>,
 100>;
 
                 > +template struct
                 bench_3<std::complex<double>,
 300>;
 
                 > +template struct
                 bench_3<std::complex<double>,
 1000>;
 
                 >  #endif
 
                 >  #endif
 
                 > diff --git
 a/benchmarks/bench3/bench3.cpp
                 b/benchmarks/bench3/bench3.cpp
 
                 > index 390d226..72a3db5 100644
 
                 > --- a/benchmarks/bench3/bench3.cpp
 
                 > +++ b/benchmarks/bench3/bench3.cpp
 
                 > @@ -76,22 +76,28 @@ void do_bench
 (std::string
                 type_string, int scale)
 
                 >      header (type_string + ",
 3");
 
                 >      bench_1<scalar, 3> ()
 (1000000 * scale);
 
                 >      bench_2<scalar, 3> ()
 (300000 * scale);
 
                 > -    bench_3<scalar, 3> ()
 (100000 * scale);
 
                 > +    bench_3<scalar, 3> ()
 (3000000 * scale);
 
                 > 
 
                 >      header (type_string + ",
 10");
 
                 >      bench_1<scalar, 10> ()
 (300000 * scale);
 
                 >      bench_2<scalar, 10> ()
 (30000 * scale);
 
                 > -    bench_3<scalar, 10> ()
 (3000 * scale);
 
                 > +    bench_3<scalar, 10> ()
 (100000 * scale);
 
                 > 
 
                 >      header (type_string + ",
 30");
 
                 >      bench_1<scalar, 30> ()
 (100000 * scale);
 
                 >      bench_2<scalar, 30> ()
 (3000 * scale);
 
                 > -    bench_3<scalar, 30> ()
 (100 * scale);
 
                 > +    bench_3<scalar, 30> ()
 (30000 * scale);
 
                 > 
 
                 >      header (type_string + ",
 100");
 
                 >      bench_1<scalar, 100> ()
 (30000 * scale);
 
                 >      bench_2<scalar, 100> ()
 (300 * scale);
 
                 > -    bench_3<scalar, 100> () (3
 * scale);
 
                 > +    bench_3<scalar, 100> ()
 (1000 * scale);
 
                 > +
 
                 > +    header (type_string + ",
 300");
 
                 > +    bench_3<scalar, 300> ()
 (30 * scale);
 
                 > +
 
                 > +    header (type_string + ",
 1000");
 
                 > +    bench_3<scalar, 1000> ()
 (1 * scale);
 
                 >  }
 
                 > 
 
                 >  int main (int argc, char *argv [])
 {
 
                 > diff --git
 a/benchmarks/bench3/bench33.cpp
                 b/benchmarks/bench3/bench33.cpp
 
                 > index 9b8e107..7eeec07 100644
 
                 > --- a/benchmarks/bench3/bench33.cpp
 
                 > +++ b/benchmarks/bench3/bench33.cpp
 
                 > @@ -172,6 +172,8 @@ template struct
                 bench_3<float, 3>;
 
                 >  template struct bench_3<float,
 10>;
 
                 >  template struct bench_3<float,
 30>;
 
                 >  template struct bench_3<float,
 100>;
 
                 > +template struct bench_3<float,
 300>;
 
                 > +template struct bench_3<float,
 1000>;
 
                 >  #endif
 
                 > 
 
                 >  #ifdef USE_DOUBLE
 
                 > @@ -179,6 +181,8 @@ template struct
                 bench_3<double, 3>;
 
                 >  template struct bench_3<double,
 10>;
 
                 >  template struct bench_3<double,
 30>;
 
                 >  template struct bench_3<double,
 100>;
 
                 > +template struct bench_3<double,
 300>;
 
                 > +template struct bench_3<double,
 1000>;
 
                 >  #endif
 
                 > 
 
                 >  #ifdef USE_STD_COMPLEX
 
                 > @@ -187,6 +191,8 @@ template struct
                 bench_3<std::complex<float>,
 3>;
 
                 >  template struct
                 bench_3<std::complex<float>,
 10>;
 
                 >  template struct
                 bench_3<std::complex<float>,
 30>;
 
                 >  template struct
                 bench_3<std::complex<float>,
 100>;
 
                 > +template struct
                 bench_3<std::complex<float>,
 300>;
 
                 > +template struct
                 bench_3<std::complex<float>,
 1000>;
 
                 >  #endif
 
                 > 
 
                 >  #ifdef USE_DOUBLE
 
                 > @@ -194,5 +200,7 @@ template struct
                 bench_3<std::complex<double>,
 3>;
 
                 >  template struct
                 bench_3<std::complex<double>,
 10>;
 
                 >  template struct
                 bench_3<std::complex<double>,
 30>;
 
                 >  template struct
                 bench_3<std::complex<double>,
 100>;
 
                 > +template struct
                 bench_3<std::complex<double>,
 300>;
 
                 > +template struct
                 bench_3<std::complex<double>,
 1000>;
 
                 >  #endif
 
                 >  #endif
 
                 
 
               
 _______________________________________________
 
                 ublas mailing list
 
                 [hidden email]
 
                 http://lists.boost.org/mailman/listinfo.cgi/ublas
 
                 Sent to: [hidden email]
                 
 
                 
                 
 
                 
 
               
             
           
         
       
       
 
       
       
 
       _______________________________________________
 ublas mailing list
 [hidden email]
 http://lists.boost.org/mailman/listinfo.cgi/ublas
 Sent to: [hidden email]
     
     
 
   
 -----Inline Attachment Follows-----
 
 _______________________________________________
 ublas mailing list
 [hidden email]
 http://lists.boost.org/mailman/listinfo.cgi/ublas
 Sent to: [hidden email]
_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

palik imre
It just ocured to me, that based on the descriptor struct it would be possible to choose between parallel and serial implementation of the kernels.

Anybody would be interested in having something like that in ublas?

Would an OpenMP parallel implementation be accepted to the library?

Thanks,

Imre


On Sunday, 6 March 2016, 10:43, palik imre <[hidden email]> wrote:



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

Nasos Iliopoulos
std::thread would be also really easy to implement using lambda functions. I would consider openmp acceptable as long as it is not the default. A couple of years ago I benchmarked code using both approaches but std::threads were slower, I assume because of not efficient implementation in all platforms. I am unsure of the situation know but it should be easy to find out.

A pragma omp type loop implemented with std::thread should look like ( I didn't test this code):

    auto num_threads = 4;
    std::vector<std::thread> workers;
    workers.reserve( num_threads ); // Maybe this can provide some speedup.
    std::vector<double> v( num_threads * 10) ;


    for (std::size_t i = 0; i !=num_threads ; i++)
{
        workers.push_back(std::thread([ i, &v ]()
        {
            auto index = i * 10;
            std::cout << "thread " << i << std::endl;
            for ( std::size_t j = 0 ; j != 10; j++)   
            {
                v( i*10 + j ) = j;
            }
        }));
    }

    for ( auto &w: workers)
    {
        w.join();
    }

All the above can be abstracted to perform kernel operations.

- Nasos

On 03/06/2016 03:58 PM, palik imre wrote:
It just ocured to me, that based on the descriptor struct it would be possible to choose between parallel and serial implementation of the kernels.

Anybody would be interested in having something like that in ublas?

Would an OpenMP parallel implementation be accepted to the library?

Thanks,

Imre


On Sunday, 6 March 2016, 10:43, palik imre [hidden email] wrote:




_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

Riccardo Rossi
In reply to this post by palik imre
Hi,

just to give my two cents, some care shall be taken so that there is no overhead for very small matrices (say 3*3)

cheers
Riccardo

On Sun, Mar 6, 2016 at 9:58 PM, palik imre <[hidden email]> wrote:
It just ocured to me, that based on the descriptor struct it would be possible to choose between parallel and serial implementation of the kernels.

Anybody would be interested in having something like that in ublas?

Would an OpenMP parallel implementation be accepted to the library?

Thanks,

Imre


On Sunday, 6 March 2016, 10:43, palik imre <[hidden email]> wrote:



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



--

Riccardo Rossi

PhD, Civil Engineer


member of the Kratos Team: www.cimne.com/kratos

lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)

Research fellow at International Center for Numerical Methods in Engineering (CIMNE)


C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9

08034 – Barcelona – Spain – www.cimne.com  - 

T.(+34) 93 401 56 96 skype: rougered4

 

Les dades personals contingudes en aquest missatge són tractades amb la finalitat de mantenir el contacte professional entre CIMNE i voste. Podra exercir els drets d'accés, rectificació, cancel·lació i oposició, dirigint-se a [hidden email]. La utilització de la seva adreça de correu electronic per part de CIMNE queda subjecte a les disposicions de la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç Electronic.

 Imprimiu aquest missatge, només si és estrictament necessari.


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

Nasos Iliopoulos
That is correct,
there are various options here but none is optimal. Do you have any suggestions?

Note: we now have fixed_vector and fixed_matrix classes than can be used to make compile time decisions with respect to size.


-Nasos

On 03/07/2016 11:56 AM, Riccardo Rossi wrote:
Hi,

just to give my two cents, some care shall be taken so that there is no overhead for very small matrices (say 3*3)

cheers
Riccardo

On Sun, Mar 6, 2016 at 9:58 PM, palik imre <[hidden email]> wrote:
It just ocured to me, that based on the descriptor struct it would be possible to choose between parallel and serial implementation of the kernels.

Anybody would be interested in having something like that in ublas?

Would an OpenMP parallel implementation be accepted to the library?

Thanks,

Imre


On Sunday, 6 March 2016, 10:43, palik imre <[hidden email]> wrote:



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



--

Riccardo Rossi

PhD, Civil Engineer


member of the Kratos Team: www.cimne.com/kratos

lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)

Research fellow at International Center for Numerical Methods in Engineering (CIMNE)


C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9

08034 – Barcelona – Spain – www.cimne.com  - 

T.(+34) 93 401 56 96 skype: rougered4

 

Les dades personals contingudes en aquest missatge són tractades amb la finalitat de mantenir el contacte professional entre CIMNE i voste. Podra exercir els drets d'accés, rectificació, cancel·lació i oposició, dirigint-se a [hidden email]. La utilització de la seva adreça de correu electronic per part de CIMNE queda subjecte a les disposicions de la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç Electronic.

 Imprimiu aquest missatge, només si és estrictament necessari.



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

palik imre
In reply to this post by Riccardo Rossi
Right now I do run-time switching between the legacy and the gemm-based implementation based on the matrix size.

This incurs a slight overhead because of the extra branch, but this is the best tradeoff I was able to come up.

I'd rather avoid introducing a new function for multiplying matrices, as there are already five ways to do that in ublas.  So it is already way more confusing than it should be.

I could change the branch to a compile-time one for matrix types where compile-time size information is available, but I am unsure of how widely those matrix types are used.  (I've never seen anybody using anything else than vanilla matrix.)  If you (or anybody) would be interested in such a feature, then please let me know.


On Monday, 7 March 2016, 17:56, Riccardo Rossi <[hidden email]> wrote:


Hi,

just to give my two cents, some care shall be taken so that there is no overhead for very small matrices (say 3*3)

cheers
Riccardo

On Sun, Mar 6, 2016 at 9:58 PM, palik imre <[hidden email]> wrote:
It just ocured to me, that based on the descriptor struct it would be possible to choose between parallel and serial implementation of the kernels.

Anybody would be interested in having something like that in ublas?

Would an OpenMP parallel implementation be accepted to the library?

Thanks,

Imre


On Sunday, 6 March 2016, 10:43, palik imre <[hidden email]> wrote:



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



--
Riccardo Rossi
PhD, Civil Engineer

member of the Kratos Team: www.cimne.com/kratos
lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)
Research fellow at International Center for Numerical Methods in Engineering (CIMNE)

C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
08034 – Barcelona – Spain – www.cimne.com  - 
T.(+34) 93 401 56 96 skype: rougered4
 
Les dades personals contingudes en aquest missatge són tractades amb la finalitat de mantenir el contacte professional entre CIMNE i voste. Podra exercir els drets d'accés, rectificació, cancel·lació i oposició, dirigint-se a [hidden email]. La utilització de la seva adreça de correu electronic per part de CIMNE queda subjecte a les disposicions de la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç Electronic.
 Imprimiu aquest missatge, només si és estrictament necessari.



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

Nasos Iliopoulos
Can you run some benchmarks comparing the legacy  with the switching one for small size arrays (up to let's say 20x20)?

-Nasos


On 03/07/2016 12:13 PM, palik imre wrote:
Right now I do run-time switching between the legacy and the gemm-based implementation based on the matrix size.

This incurs a slight overhead because of the extra branch, but this is the best tradeoff I was able to come up.

I'd rather avoid introducing a new function for multiplying matrices, as there are already five ways to do that in ublas.  So it is already way more confusing than it should be.

I could change the branch to a compile-time one for matrix types where compile-time size information is available, but I am unsure of how widely those matrix types are used.  (I've never seen anybody using anything else than vanilla matrix.)  If you (or anybody) would be interested in such a feature, then please let me know.


On Monday, 7 March 2016, 17:56, Riccardo Rossi [hidden email] wrote:


Hi,

just to give my two cents, some care shall be taken so that there is no overhead for very small matrices (say 3*3)

cheers
Riccardo

On Sun, Mar 6, 2016 at 9:58 PM, palik imre <[hidden email]> wrote:
It just ocured to me, that based on the descriptor struct it would be possible to choose between parallel and serial implementation of the kernels.

Anybody would be interested in having something like that in ublas?

Would an OpenMP parallel implementation be accepted to the library?

Thanks,

Imre


On Sunday, 6 March 2016, 10:43, palik imre <[hidden email]> wrote:



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



--
Riccardo Rossi
PhD, Civil Engineer

member of the Kratos Team: www.cimne.com/kratos
lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)
Research fellow at International Center for Numerical Methods in Engineering (CIMNE)

C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
08034 – Barcelona – Spain – www.cimne.com  - 
T.(+34) 93 401 56 96 skype: rougered4
 
Les dades personals contingudes en aquest missatge són tractades amb la finalitat de mantenir el contacte professional entre CIMNE i voste. Podra exercir els drets d'accés, rectificació, cancel·lació i oposició, dirigint-se a [hidden email]. La utilització de la seva adreça de correu electronic per part de CIMNE queda subjecte a les disposicions de la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç Electronic.
 Imprimiu aquest missatge, només si és estrictament necessari.




_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

Riccardo Rossi
In reply to this post by Nasos Iliopoulos

Well the problem is that fixed matrices and vectors can not be used if you don t know the size at compile time

In my application (finite elements) I have Millons of matrices of size around 10by 10 on which I operate within outer openmp loops.

Having openmp in the matrix would be a killer for me

Riccardo

On 7 Mar 2016 18:13, "Nasos Iliopoulos" <[hidden email]> wrote:
That is correct,
there are various options here but none is optimal. Do you have any suggestions?

Note: we now have fixed_vector and fixed_matrix classes than can be used to make compile time decisions with respect to size.


-Nasos

On 03/07/2016 11:56 AM, Riccardo Rossi wrote:
Hi,

just to give my two cents, some care shall be taken so that there is no overhead for very small matrices (say 3*3)

cheers
Riccardo

On Sun, Mar 6, 2016 at 9:58 PM, palik imre <[hidden email]> wrote:
It just ocured to me, that based on the descriptor struct it would be possible to choose between parallel and serial implementation of the kernels.

Anybody would be interested in having something like that in ublas?

Would an OpenMP parallel implementation be accepted to the library?

Thanks,

Imre


On Sunday, 6 March 2016, 10:43, palik imre <[hidden email]> wrote:



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



--

Riccardo Rossi

PhD, Civil Engineer


member of the Kratos Team: www.cimne.com/kratos

lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)

Research fellow at International Center for Numerical Methods in Engineering (CIMNE)


C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9

08034 – Barcelona – Spain – www.cimne.com  - 

T.(+34) 93 401 56 96 skype: rougered4

 

Les dades personals contingudes en aquest missatge són tractades amb la finalitat de mantenir el contacte professional entre CIMNE i voste. Podra exercir els drets d'accés, rectificació, cancel·lació i oposició, dirigint-se a [hidden email]. La utilització de la seva adreça de correu electronic per part de CIMNE queda subjecte a les disposicions de la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç Electronic.

 Imprimiu aquest missatge, només si és estrictament necessari.



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

Nasos Iliopoulos
I suppose a define switch to disable the parallelization all together would be ok right?

-N


On 03/08/2016 01:12 PM, Riccardo Rossi wrote:

Well the problem is that fixed matrices and vectors can not be used if you don t know the size at compile time

In my application (finite elements) I have Millons of matrices of size around 10by 10 on which I operate within outer openmp loops.

Having openmp in the matrix would be a killer for me

Riccardo

On 7 Mar 2016 18:13, "Nasos Iliopoulos" <[hidden email]> wrote:
That is correct,
there are various options here but none is optimal. Do you have any suggestions?

Note: we now have fixed_vector and fixed_matrix classes than can be used to make compile time decisions with respect to size.


-Nasos

On 03/07/2016 11:56 AM, Riccardo Rossi wrote:
Hi,

just to give my two cents, some care shall be taken so that there is no overhead for very small matrices (say 3*3)

cheers
Riccardo

On Sun, Mar 6, 2016 at 9:58 PM, palik imre <[hidden email]> wrote:
It just ocured to me, that based on the descriptor struct it would be possible to choose between parallel and serial implementation of the kernels.

Anybody would be interested in having something like that in ublas?

Would an OpenMP parallel implementation be accepted to the library?

Thanks,

Imre


On Sunday, 6 March 2016, 10:43, palik imre <[hidden email]> wrote:



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



--

Riccardo Rossi

PhD, Civil Engineer


member of the Kratos Team: www.cimne.com/kratos

lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)

Research fellow at International Center for Numerical Methods in Engineering (CIMNE)


C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9

08034 – Barcelona – Spain – www.cimne.com  - 

T.(+34) 93 401 56 96 skype: rougered4

 

Les dades personals contingudes en aquest missatge són tractades amb la finalitat de mantenir el contacte professional entre CIMNE i voste. Podra exercir els drets d'accés, rectificació, cancel·lació i oposició, dirigint-se a [hidden email]. La utilització de la seva adreça de correu electronic per part de CIMNE queda subjecte a les disposicions de la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç Electronic.

 Imprimiu aquest missatge, només si és estrictament necessari.



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

Riccardo Rossi

Well, ideally you do want the parallelization for the sparse linear algebra...

Right now we do it ourselves by low level access to the crs internals, however if you pit openmp that would be a good fit

Cheers
Riccardo

On 8 Mar 2016 19:22, "Nasos Iliopoulos" <[hidden email]> wrote:
I suppose a define switch to disable the parallelization all together would be ok right?

-N


On 03/08/2016 01:12 PM, Riccardo Rossi wrote:

Well the problem is that fixed matrices and vectors can not be used if you don t know the size at compile time

In my application (finite elements) I have Millons of matrices of size around 10by 10 on which I operate within outer openmp loops.

Having openmp in the matrix would be a killer for me

Riccardo

On 7 Mar 2016 18:13, "Nasos Iliopoulos" <[hidden email]> wrote:
That is correct,
there are various options here but none is optimal. Do you have any suggestions?

Note: we now have fixed_vector and fixed_matrix classes than can be used to make compile time decisions with respect to size.


-Nasos

On 03/07/2016 11:56 AM, Riccardo Rossi wrote:
Hi,

just to give my two cents, some care shall be taken so that there is no overhead for very small matrices (say 3*3)

cheers
Riccardo

On Sun, Mar 6, 2016 at 9:58 PM, palik imre <[hidden email]> wrote:
It just ocured to me, that based on the descriptor struct it would be possible to choose between parallel and serial implementation of the kernels.

Anybody would be interested in having something like that in ublas?

Would an OpenMP parallel implementation be accepted to the library?

Thanks,

Imre


On Sunday, 6 March 2016, 10:43, palik imre <[hidden email]> wrote:



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



--

Riccardo Rossi

PhD, Civil Engineer


member of the Kratos Team: www.cimne.com/kratos

lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)

Research fellow at International Center for Numerical Methods in Engineering (CIMNE)


C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9

08034 – Barcelona – Spain – www.cimne.com  - 

T.(+34) 93 401 56 96 skype: rougered4

 

Les dades personals contingudes en aquest missatge són tractades amb la finalitat de mantenir el contacte professional entre CIMNE i voste. Podra exercir els drets d'accés, rectificació, cancel·lació i oposició, dirigint-se a [hidden email]. La utilització de la seva adreça de correu electronic per part de CIMNE queda subjecte a les disposicions de la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç Electronic.

 Imprimiu aquest missatge, només si és estrictament necessari.



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

palik imre
In reply to this post by Riccardo Rossi
I don't plan to enable openmp by default, just provide some infrastructure for the user to explicitly enable it.

As for your case, for 10x10 matrices it doesn't even worth to use gemm.  Would bounded matrices work for you?  If the bounds are below the breakeven point for gemm (set to 27 for double currently), I could arrange for a compile-time switch.


On Tuesday, 8 March 2016, 19:13, Riccardo Rossi <[hidden email]> wrote:


Well the problem is that fixed matrices and vectors can not be used if you don t know the size at compile time
In my application (finite elements) I have Millons of matrices of size around 10by 10 on which I operate within outer openmp loops.
Having openmp in the matrix would be a killer for me
Riccardo
On 7 Mar 2016 18:13, "Nasos Iliopoulos" <[hidden email]> wrote:
That is correct,
there are various options here but none is optimal. Do you have any suggestions?

Note: we now have fixed_vector and fixed_matrix classes than can be used to make compile time decisions with respect to size.


-Nasos

On 03/07/2016 11:56 AM, Riccardo Rossi wrote:
Hi,

just to give my two cents, some care shall be taken so that there is no overhead for very small matrices (say 3*3)

cheers
Riccardo

On Sun, Mar 6, 2016 at 9:58 PM, palik imre <[hidden email]> wrote:
It just ocured to me, that based on the descriptor struct it would be possible to choose between parallel and serial implementation of the kernels.

Anybody would be interested in having something like that in ublas?

Would an OpenMP parallel implementation be accepted to the library?

Thanks,

Imre


On Sunday, 6 March 2016, 10:43, palik imre <[hidden email][hidden email]> wrote:



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



--
Riccardo Rossi
PhD, Civil Engineer

member of the Kratos Team: www.cimne.com/kratos
lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)
Research fellow at International Center for Numerical Methods in Engineering (CIMNE)

C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
08034 – Barcelona – Spain – www.cimne.com  - 
T.(+34) 93 401 56 96 skype: rougered4
 
Les dades personals contingudes en aquest missatge són tractades amb la finalitat de mantenir el contacte professional entre CIMNE i voste. Podra exercir els drets d'accés, rectificació, cancel·lació i oposició, dirigint-se a [hidden email]. La utilització de la seva adreça de correu electronic per part de CIMNE queda subjecte a les disposicions de la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç Electronic.
 Imprimiu aquest missatge, només si és estrictament necessari.


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

Riccardo Rossi

Nope, I use bounded_matrices when I know the size at compile time.

But most often I only know it at runtime

Simple implementations are the best... Important thing is not to have ifs at runtime

Can t u create a LargeMatrix class where you do the complicated things and leave the Matrix class nice and simple?

Cheers
Riccardo

On 8 Mar 2016 22:04, "palik imre" <[hidden email]> wrote:
I don't plan to enable openmp by default, just provide some infrastructure for the user to explicitly enable it.

As for your case, for 10x10 matrices it doesn't even worth to use gemm.  Would bounded matrices work for you?  If the bounds are below the breakeven point for gemm (set to 27 for double currently), I could arrange for a compile-time switch.


On Tuesday, 8 March 2016, 19:13, Riccardo Rossi <[hidden email]> wrote:


Well the problem is that fixed matrices and vectors can not be used if you don t know the size at compile time
In my application (finite elements) I have Millons of matrices of size around 10by 10 on which I operate within outer openmp loops.
Having openmp in the matrix would be a killer for me
Riccardo
On 7 Mar 2016 18:13, "Nasos Iliopoulos" <[hidden email]> wrote:
That is correct,
there are various options here but none is optimal. Do you have any suggestions?

Note: we now have fixed_vector and fixed_matrix classes than can be used to make compile time decisions with respect to size.


-Nasos

On 03/07/2016 11:56 AM, Riccardo Rossi wrote:
Hi,

just to give my two cents, some care shall be taken so that there is no overhead for very small matrices (say 3*3)

cheers
Riccardo

On Sun, Mar 6, 2016 at 9:58 PM, palik imre <[hidden email]> wrote:
It just ocured to me, that based on the descriptor struct it would be possible to choose between parallel and serial implementation of the kernels.

Anybody would be interested in having something like that in ublas?

Would an OpenMP parallel implementation be accepted to the library?

Thanks,

Imre


On Sunday, 6 March 2016, 10:43, palik imre <[hidden email][hidden email]> wrote:



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



--
Riccardo Rossi
PhD, Civil Engineer

member of the Kratos Team: www.cimne.com/kratos
lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)
Research fellow at International Center for Numerical Methods in Engineering (CIMNE)

C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
08034 – Barcelona – Spain – www.cimne.com  - 
T.(+34) 93 401 56 96 skype: rougered4
 
Les dades personals contingudes en aquest missatge són tractades amb la finalitat de mantenir el contacte professional entre CIMNE i voste. Podra exercir els drets d'accés, rectificació, cancel·lació i oposició, dirigint-se a [hidden email]. La utilització de la seva adreça de correu electronic per part de CIMNE queda subjecte a les disposicions de la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç Electronic.
 Imprimiu aquest missatge, només si és estrictament necessari.


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [PATCH 3/3] boost::ublas increasing the range of BLAS level 3 benchmarks

Nasos Iliopoulos
I think the matrix abstraction the way it is now is ok. It would be confusing to have a large matrix class.

My current thinking is that we should have a two tier switch. One that detects that openmp is enabled and one that enables parallelization based on user preference:

#ifdef _OPENMP && BOOST_UBLAS_PARALLEL
// parallel code (with runtime switch if needed) goes here
#else
#ifdef BOOST_UBLAS_PARALLEL
#warning "OPENMP not present. boost::ublas parallel mode not enabled."
#end
// serial code goes here
#endif

to enable parallel mode:
gcc myfile.cpp -o myexe -fopenmp -DBOOST_UBLAS_PARALLEL

the following does not enabe ublas parallel mode but let the user's openmp code run:
gcc myfile.cpp -o myexe -fopenmp

this will not enable parallelization at all:
gcc myfile.cpp -o myexe


essentially _OPENMP is defined when you pass the -fopenmp argument to gcc and I suppose in all other compilers that support the standard.

* One downside of this approach is that temporarily disabling ublas parallel mode would need some hoci poci.

* I think that this approach is better than nothing and If you can think of a more clear and/or efficient way please voice it.

* I would favor the std::thread approach but thinking about it again I believe we will need to introduce state so the we have a facility to define the number of threads.We could use (http://en.cppreference.com/w/cpp/utility/program/getenv) but this wouldn't allow for after-execution changes. On the other hand openmp has state and the user can use it deliberately.

-Nasos


On 03/08/2016 04:45 PM, Riccardo Rossi wrote:

Nope, I use bounded_matrices when I know the size at compile time.

But most often I only know it at runtime

Simple implementations are the best... Important thing is not to have ifs at runtime

Can t u create a LargeMatrix class where you do the complicated things and leave the Matrix class nice and simple?

Cheers
Riccardo

On 8 Mar 2016 22:04, "palik imre" <[hidden email]> wrote:
I don't plan to enable openmp by default, just provide some infrastructure for the user to explicitly enable it.

As for your case, for 10x10 matrices it doesn't even worth to use gemm.  Would bounded matrices work for you?  If the bounds are below the breakeven point for gemm (set to 27 for double currently), I could arrange for a compile-time switch.


On Tuesday, 8 March 2016, 19:13, Riccardo Rossi <[hidden email]> wrote:


Well the problem is that fixed matrices and vectors can not be used if you don t know the size at compile time
In my application (finite elements) I have Millons of matrices of size around 10by 10 on which I operate within outer openmp loops.
Having openmp in the matrix would be a killer for me
Riccardo
On 7 Mar 2016 18:13, "Nasos Iliopoulos" <[hidden email]> wrote:
That is correct,
there are various options here but none is optimal. Do you have any suggestions?

Note: we now have fixed_vector and fixed_matrix classes than can be used to make compile time decisions with respect to size.


-Nasos

On 03/07/2016 11:56 AM, Riccardo Rossi wrote:
Hi,

just to give my two cents, some care shall be taken so that there is no overhead for very small matrices (say 3*3)

cheers
Riccardo

On Sun, Mar 6, 2016 at 9:58 PM, palik imre <[hidden email]> wrote:
It just ocured to me, that based on the descriptor struct it would be possible to choose between parallel and serial implementation of the kernels.

Anybody would be interested in having something like that in ublas?

Would an OpenMP parallel implementation be accepted to the library?

Thanks,

Imre


On Sunday, 6 March 2016, 10:43, palik imre <[hidden email]> wrote:



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



--
Riccardo Rossi
PhD, Civil Engineer

member of the Kratos Team: www.cimne.com/kratos
lecturer at Universitat Politècnica de Catalunya, BarcelonaTech (UPC)
Research fellow at International Center for Numerical Methods in Engineering (CIMNE)

C/ Gran Capità, s/n, Campus Nord UPC, Ed. C1, Despatx C9
08034 – Barcelona – Spain – www.cimne.com  - 
T.(+34) 93 401 56 96 skype: rougered4
 
Les dades personals contingudes en aquest missatge són tractades amb la finalitat de mantenir el contacte professional entre CIMNE i voste. Podra exercir els drets d'accés, rectificació, cancel·lació i oposició, dirigint-se a [hidden email]. La utilització de la seva adreça de correu electronic per part de CIMNE queda subjecte a les disposicions de la Llei 34/2002, de Serveis de la Societat de la Informació i el Comerç Electronic.
 Imprimiu aquest missatge, només si és estrictament necessari.


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]

_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]



_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]


_______________________________________________
ublas mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/ublas
Sent to: [hidden email]
12
Loading...