[multiprecision] Getting the best performance on windows for uint128_t etc

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[multiprecision] Getting the best performance on windows for uint128_t etc

Boost - Users mailing list

Hi,

I develop a program for generating optimal addition chains. I normally use 64 bit integers but in order to attack some conjectures I compile up my program with boost using uint128_t.

In order to get some functionality I need I used the backend access to limbs. For example I need population count/hamming weight/binary digit count. I did this like this:

 

static

LPTYPER

bits(NTYPE n)

/*++

 

Calculates the number of 1 bits in a number.

 

--*/

{

                LPTYPER b = 0;

                std::size_t size = n.backend().size();

                boost::multiprecision::limb_type* p = n.backend().limbs();

                for (std::size_t i = 0; i < size; i++) {

                                b += (LPTYPER)__popcnt64(p[i]);

                }

                return b;

}

 

I use the same sort of logic to calculate Floor(Log2(n)) (number of largest bit set), to clear the bottom say b bits of an integer.

Is this a reasonable (though probably not idea) way to go?

I noted that the limbs are 32 bits on Windows and this leads me to my second question. How can I get the internals of boost using bigger chunks like 64 bit?

It seems I need a compiler supporting __int64. I have access to the intel compiler (19) and MSVC but neither seem to change this. I tried LLVM but that seemed quite slow. I may not have put enough effort into investigating LLVM up to this point.

 

Various bit operations seem overly slow in boost. For example I use sequences like this:

 

                                Control &= Control + 1;

                                Mask = ((Control - 1) ^ Control) >> 1;

                                Control |= Mask;

 

These seem to be bottlenecks under boost. That sequence above is meant to generate masks that straddle the first run of zero bits in control so 1010100000111111_2 produces a mask of 11111111111_2.

Thanks.

Neill.

 

Sent from Mail for Windows 10

 


_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [multiprecision] Getting the best performance on windows for uint128_t etc

Boost - Users mailing list
On Thu, 11 Jul 2019 at 22:15, Neill Clift via Boost-users <[hidden email]> wrote:

I use the same sort of logic to calculate Floor(Log2(n)) (number of largest bit set), to clear the bottom say b bits of an integer.

Is this a reasonable (though probably not idea) way to go?


Iff you're gonna fiddle with the limbs anyway, why not just set the bits to zero, directly.
 

I noted that the limbs are 32 bits on Windows and this leads me to my second question. How can I get the internals of boost using bigger chunks like 64 bit?


You'll have to compile the code for 64-bits (x64), then you'll get 64-bit limbs.

It seems I need a compiler supporting __int64.


 You should use the std-types std::int64_t and std::uint64_t (#include <cstdint>).

I tried LLVM but that seemed quite slow. I may not have put enough effort into investigating LLVM up to this point.


Clang/LLVM (and gcc) has a builtin type (in 64-bit mode only) as an extension __uint128_t (pass (-Xclang) -fforce-enable-int128).

degski
--
@realdegski
"Anyone who believes that exponential growth can go on forever in a finite world is either a madman or an economist" - Kenneth E. Boulding

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [multiprecision] Getting the best performance on windows for uint128_t etc

Boost - Users mailing list

>Iff you're gonna fiddle with the limbs anyway, why not just set the bits to zero, directly.

 

I am unsure what you mean by this. For clearing the bottom b set bits in a value I do assign them to zero unless I won’t be clearing them all.

Then I use the packed deposit instruction of x86. This is a bit complicated but it has a huge effect on performance for the 64 bit code.

 

>You'll have to compile the code for 64-bits (x64), then you'll get 64-bit limbs.

 

I am compiling with 64 bit and I get 32 bit limbs.

 

It seems I need a compiler supporting __int64.

 

> You should use the std-types std::int64_t and std::uint64_t (#include <cstdint>).

 

Sorry I misspoke here. My reading suggests the compiler needs to support __int128 to get 64 bit limbs. I compile normally as 64 bit and use 64 bit unsigned ints.

 

 

>Clang/LLVM (and gcc) has a builtin type (in 64-bit mode only) as an extension __uint128_t (pass (-Xclang) -fforce-enable-int128).

 

I’ll give that a go thanks.

Neill.



_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [multiprecision] Getting the best performance on windows for uint128_t etc

Boost - Users mailing list
In reply to this post by Boost - Users mailing list

On 11/07/2019 18:04, Neill Clift via Boost-users wrote:

>
> Hi,
>
> I develop a program for generating optimal addition chains. I normally
> use 64 bit integers but in order to attack some conjectures I compile
> up my program with boost using uint128_t.
>
> In order to get some functionality I need I used the backend access to
> limbs. For example I need population count/hamming weight/binary digit
> count. I did this like this:
>
> static
>
> LPTYPER
>
> bits(NTYPE n)
>
> /*++
>
> Calculates the number of 1 bits in a number.
>
> --*/
>
> {
>
>                 LPTYPER b = 0;
>
>                 std::size_t size = n.backend().size();
>
> boost::multiprecision::limb_type* p = n.backend().limbs();
>
>                 for (std::size_t i = 0; i < size; i++) {
>
>                                 b += (LPTYPER)__popcnt64(p[i]);
>
>                 }
>
>                 return b;
>
> }
>
If you're writing platform/compiler specific code anyway, you could rely
on the fact that uint128_t will have an even number of limbs, and
reinterpret_cast the limb pointer to a pointer to uint64_t's and have
things still work - for performance reasons multiprecision does this
internally when initializing from a type twice the width of a limb. 
There are also typedefs:

boost::multiprecision::detail::limb_type, and
boost::multiprecision::detail::double_limb_type.

Which could help here.

Aside: adding a native population count function might well be a useful
addition to the library...

> I use the same sort of logic to calculate Floor(Log2(n)) (number of
> largest bit set), to clear the bottom say b bits of an integer.
>
> Is this a reasonable (though probably not idea) way to go?
>
> I noted that the limbs are 32 bits on Windows and this leads me to my
> second question. How can I get the internals of boost using bigger
> chunks like 64 bit?
>
The limb size is always half the width of the largest compiler supported
integer type - this is a necessary condition to implement arithmetic
entirely within the language.

So on GCC we have __int128 and so the limb size is 64-bits, but on msvc
there's nothing wider than a 64 bit integer type, so the limbs are 32 bits.

If clang-win can have __int128 support enabled, then defining
BOOST_HAS_INT128 will activate it's use and you'll get 64-bit limbs in
general, though in this specific case boost::uint128_t will become a
thin wrapper around unsigned __int128 (ie will have only one limb).  Of
course in this specific case, if you have unsigned __int128 then you
don't really need Boost.Multiprecision unless it's to ensure portability
of the program to platforms without that type.


> It seems I need a compiler supporting __int64. I have access to the
> intel compiler (19) and MSVC but neither seem to change this. I tried
> LLVM but that seemed quite slow. I may not have put enough effort into
> investigating LLVM up to this point.
>
> Various bit operations seem overly slow in boost. For example I use
> sequences like this:
>
>                                 Control &= Control + 1;
>
>                                 Mask = ((Control - 1) ^ Control) >> 1;
>
>                                 Control |= Mask;
>
> These seem to be bottlenecks under boost. That sequence above is meant
> to generate masks that straddle the first run of zero bits in control
> so 1010100000111111_2 produces a mask of 11111111111_2.
>
Well I'm always open to suggestions for performance improvements, but
there's not much low hanging fruit in the bitwise operations which are
pretty simple on the whole.

BTW I'm not sure they help here, but there are a few operations like
lsb/msb which use compiler intrinsics on msvc should they help.

HTH, John.

> Thanks.
>
> Neill.
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
>
> _______________________________________________
> Boost-users mailing list
> [hidden email]
> https://lists.boost.org/mailman/listinfo.cgi/boost-users

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [multiprecision] Getting the best performance on windows for uint128_t etc

Boost - Users mailing list
On Fri, 12 Jul 2019 at 10:38, John Maddock via Boost-users <[hidden email]> wrote:
Aside: adding a native population count function might well be a useful
addition to the library...

Native clz/ctz (_BitScanReverse/_BitScanForward) might be useful as well.
 
The limb size is always half the width of the largest compiler supported
integer type - this is a necessary condition to implement arithmetic
entirely within the language.

This seems a high price to pay [I guess the 'within the language' is key here]. At some point in time [when 64-bit started to be a general thing] there was some info regarding this on gmplib.org, which basically stated that 64-bit limbs [on x64] were 4 times faster than 32-bit limbs. GMP (MPIR) does 64-bit limbs on Windows, no problem [but yes it's C].

If clang-win can have __int128 support enabled

clang-cl can/will have __int128 support enabled by passing it '-Xclang -fforce-enable-int128', i.e. pass '-fforce-enable-int128' to clang++.

degski
--
@realdegski
"Anyone who believes that exponential growth can go on forever in a finite world is either a madman or an economist" - Kenneth E. Boulding

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [multiprecision] Getting the best performance on windows for uint128_t etc

Boost - Users mailing list

On 12/07/2019 09:12, degski via Boost-users wrote:
> On Fri, 12 Jul 2019 at 10:38, John Maddock via Boost-users
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Aside: adding a native population count function might well be a
>     useful
>     addition to the library...
>
>
> Native clz/ctz (_BitScanReverse/_BitScanForward) might be useful as well.

That's done as msb/lsb (most/least significant bit).

>     The limb size is always half the width of the largest compiler
>     supported
>     integer type - this is a necessary condition to implement arithmetic
>     entirely within the language.
>
>
> This seems a high price to pay [I guess the 'within the language' is
> key here]. At some point in time [when 64-bit started to be a general
> thing] there was some info regarding this on gmplib.org
> <http://gmplib.org>, which basically stated that 64-bit limbs [on x64]
> were 4 times faster than 32-bit limbs. GMP (MPIR) does 64-bit limbs on
> Windows, no problem [but yes it's C].

That's only possible with assembly level add/multiply/divide.

Ah OK, just looked at MPIR on win64 with the mpir_gc backend, and they
do use 64-bit limbs, but rely on GNU's longlong.h for
add/subtract/multiply which appears not to use inline assembly for msvc,
but instead breaks the inputs into high/low parts and then does
schoolbook multplication.  I would expect that to be quite a bit slower
than just using 32-bit limbs in the first place, but no doubt you would
make gains on the bitwise operations, likewise if you're using an
assembly level backend.

Best, John.

>
>     If clang-win can have __int128 support enabled
>
>
> clang-cl can/will have __int128 support enabled by passing it '-Xclang
> -fforce-enable-int128', i.e. pass '-fforce-enable-int128' to clang++.
>
> degski
> --
> @realdegski
> https://edition.cnn.com/interactive/2019/06/middleeast/saudi-teen-death-penalty-intl/
> "Anyone who believes that exponential growth can go on forever in a
> finite world is either a madman or an economist" - Kenneth E. Boulding
>
> _______________________________________________
> Boost-users mailing list
> [hidden email]
> https://lists.boost.org/mailman/listinfo.cgi/boost-users

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [multiprecision] Getting the best performance on windows for uint128_t etc

Boost - Users mailing list
On Fri, 12 Jul 2019 at 13:21, John Maddock via Boost-users <[hidden email]> wrote:
That's only possible with assembly level add/multiply/divide.

Right.

Ah OK, just looked at MPIR on win64 with the mpir_gc backend, and they
do use 64-bit limbs, but rely on GNU's longlong.h for
add/subtract/multiply which appears not to use inline assembly for msvc,
but instead breaks the inputs into high/low parts and then does
schoolbook multplication.

The mpir_gc ([g]eneric [c]) backend will by definition not use assembly, one needs to build something a bit closer to one's hardware (run the mpir_config.py script [in the msvc sub-folder] to configure, generate solution files and build) to get any performance. It requires yasm as assembler [installed and properly integrated into VS20XX]. Altogether it's neither obvious nor trivial to get something [a binary] that works well.

degski
--
@realdegski
"Anyone who believes that exponential growth can go on forever in a finite world is either a madman or an economist" - Kenneth E. Boulding

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users
Reply | Threaded
Open this post in threaded view
|

Re: [multiprecision] Getting the best performance on windows for uint128_t etc

Boost - Users mailing list
In reply to this post by Boost - Users mailing list

That's only possible with assembly level add/multiply/divide.


For the record, MSVC has compiler intrinsics for double-wide arithmetics, both signed and unsigned. For example:


https://docs.microsoft.com/en-us/cpp/intrinsics/mul128?view=vs-2019


-- Stian



From: Boost-users <[hidden email]> on behalf of John Maddock via Boost-users <[hidden email]>
Sent: Friday, July 12, 2019 12:21:30 PM
To: degski via Boost-users
Cc: John Maddock
Subject: Re: [Boost-users] [multiprecision] Getting the best performance on windows for uint128_t etc
 

On 12/07/2019 09:12, degski via Boost-users wrote:
> On Fri, 12 Jul 2019 at 10:38, John Maddock via Boost-users
> <[hidden email] <[hidden email]>> wrote:
>
>     Aside: adding a native population count function might well be a
>     useful
>     addition to the library...
>
>
> Native clz/ctz (_BitScanReverse/_BitScanForward) might be useful as well.

That's done as msb/lsb (most/least significant bit).

>     The limb size is always half the width of the largest compiler
>     supported
>     integer type - this is a necessary condition to implement arithmetic
>     entirely within the language.
>
>
> This seems a high price to pay [I guess the 'within the language' is
> key here]. At some point in time [when 64-bit started to be a general
> thing] there was some info regarding this on gmplib.org
> <http://gmplib.org>, which basically stated that 64-bit limbs [on x64]
> were 4 times faster than 32-bit limbs. GMP (MPIR) does 64-bit limbs on
> Windows, no problem [but yes it's C].

That's only possible with assembly level add/multiply/divide.

Ah OK, just looked at MPIR on win64 with the mpir_gc backend, and they
do use 64-bit limbs, but rely on GNU's longlong.h for
add/subtract/multiply which appears not to use inline assembly for msvc,
but instead breaks the inputs into high/low parts and then does
schoolbook multplication.  I would expect that to be quite a bit slower
than just using 32-bit limbs in the first place, but no doubt you would
make gains on the bitwise operations, likewise if you're using an
assembly level backend.

Best, John.

>
>     If clang-win can have __int128 support enabled
>
>
> clang-cl can/will have __int128 support enabled by passing it '-Xclang
> -fforce-enable-int128', i.e. pass '-fforce-enable-int128' to clang++.
>
> degski
> --
> @realdegski
> https://edition.cnn.com/interactive/2019/06/middleeast/saudi-teen-death-penalty-intl/
> "Anyone who believes that exponential growth can go on forever in a
> finite world is either a madman or an economist" - Kenneth E. Boulding
>
> _______________________________________________
> Boost-users mailing list
> [hidden email]
> https://lists.boost.org/mailman/listinfo.cgi/boost-users

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users

_______________________________________________
Boost-users mailing list
[hidden email]
https://lists.boost.org/mailman/listinfo.cgi/boost-users