[dynamic_bitset] Using of GCC built-in functions may increase performance of some methods

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[dynamic_bitset] Using of GCC built-in functions may increase performance of some methods

Boost - Dev mailing list
Hi all, when I was playing around the Boost.DynamicBitset library, I
noticed the last lines of the count() function (which returns the number of
bits in the bitset that are set):

return do_count(m_bits.begin(), num_blocks(), Block(0),
    static_cast<value_to_type<(bool)mode> *>(0));

Well, this line was modified 10 years ago last time.

Let's go deeper. GNU C provides several language features not found in ISO
standard C. These include a bunch of bit operations - you can look at them
here: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html.
For example, *__builtin_popcountll *counts the number of bits in an
'unsigned long long', so we can use this kind of code to calculate bits
(don't judge, I wrote this fast and didn't think of coding standards):

size_type res = 0;
for (size_type i = 0; i < m_bits.size(); i++) {
    res += __builtin_popcountll(m_bits[i]);
}

I checked that m_bits[] and popcount() both work with 64-bit values on my
machine. It halved the running time of a program that has a bitset and
calls count() over and over again - here's the code
<https://gist.github.com/Izaron/5005fed4184ee6483699191dc5ef3677> I used.
These __builtin_XXX functions can be used everywhere where it's possible,
with preprocessor directives. Or is it actually used and longer time means
something else?
I used the latest version of Boost.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [dynamic_bitset] Using of GCC built-in functions may increase performance of some methods

Boost - Dev mailing list
On Wed, Aug 22, 2018 at 10:03 PM, Evgeny Shulgin via Boost
<[hidden email]> wrote:

> Hi all, when I was playing around the Boost.DynamicBitset library, I
> noticed the last lines of the count() function (which returns the number of
> bits in the bitset that are set):
>
> return do_count(m_bits.begin(), num_blocks(), Block(0),
>     static_cast<value_to_type<(bool)mode> *>(0));
>
> Well, this line was modified 10 years ago last time.
>
> Let's go deeper. GNU C provides several language features not found in ISO
> standard C. These include a bunch of bit operations - you can look at them
> here: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html.
> For example, *__builtin_popcountll *counts the number of bits in an
> 'unsigned long long', so we can use this kind of code to calculate bits
> (don't judge, I wrote this fast and didn't think of coding standards):
>
> size_type res = 0;
> for (size_type i = 0; i < m_bits.size(); i++) {
>     res += __builtin_popcountll(m_bits[i]);
> }
>
> I checked that m_bits[] and popcount() both work with 64-bit values on my
> machine. It halved the running time of a program that has a bitset and
> calls count() over and over again - here's the code
> <https://gist.github.com/Izaron/5005fed4184ee6483699191dc5ef3677> I used.
> These __builtin_XXX functions can be used everywhere where it's possible,
> with preprocessor directives. Or is it actually used and longer time means
> something else?
> I used the latest version of Boost.

I think such an optimization would be useful. Note that MSVC also has
intrinsics for popcount[1], although I don't think those are supported
when the target CPU doesn't implement the corresponding instructions.
You would have to check at compile time whether the target CPU
supports it (e.g. by checking if __AVX__ is defined).

I think, Boost.DynamicBitset is community-maintained. If you create a
PR, someone from CMT will take a look.

[1]: https://msdn.microsoft.com/en-us/library/bb385231.aspx

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [dynamic_bitset] Using of GCC built-in functions may increase performance of some methods

Boost - Dev mailing list
On 23/08/2018 09:16, Andrey Semashev wrote:
> I think such an optimization would be useful. Note that MSVC also has
> intrinsics for popcount[1], although I don't think those are supported
> when the target CPU doesn't implement the corresponding instructions.
> You would have to check at compile time whether the target CPU
> supports it (e.g. by checking if __AVX__ is defined).

While compile-time detection is better, if you can do it (because it
lets it be completely inlined); if the compile-time detection fails, you
can still do runtime detection, eg. by defining something like:

// header file
extern int (*popcnt64)(uint64_t);

// source file
static bool is_popcnt_supported()
{
     int info[4] = { 0 };
     __cpuid(info, 1);
     return (info[2] & 0x00800000) != 0;
}

static int popcnt64_intrinsic(uint64_t v)
{
     return /* _mm_popcnt_64(v) or __builtin_popcountll(v) */;
}

static int popcnt64_emulation(uint64_t v)
{
    // code that calculates it with bit twiddling
}

static int popcnt64_auto(uint64_t v)
{
     popcnt64 = is_popcnt_supported()
         ? &popcnt64_intrinsic
         : &popcnt64_emulation;
     return popcnt64(v);
}

int (*popcnt64)(uint64_t) = &popcnt64_auto;

Repeat for other argument sizes as needed.  You could probably do
something fancier with C++11 guaranteed static initialisation, but this
will work on all compilers.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [dynamic_bitset] Using of GCC built-in functions may increase performance of some methods

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Thu, 23 Aug 2018 at 00:17, Andrey Semashev via Boost <
[hidden email]> wrote:

> I think such an optimization would be useful. Note that MSVC also has
> intrinsics for popcount[1], although I don't think those are supported
> when the target CPU doesn't implement the corresponding instructions.
>

This instruction is supported on Intel Penryn (x86, 2007) and AMD 3rd Gen.
and onward from there.

You would have to check at compile time whether the target CPU
> supports it (e.g. by checking if __AVX__ is defined).
>

On a Penryn, __AVX__ will give a false negative. Checking  __cpuid as Gavin
writes should also work on that processor.

On ARM, which appears to have become an important cpu-type, there is the
corresponding VCNT instruction:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/CIHCFDBJ.html
, but this instruction does not seem to be provided as an intrinsic by
Microsoft.

degski
--
*“If something cannot go on forever, it will stop" - Herbert Stein*
*“No, it isn’t truth. Truth isn’t truth" - Rudolph W. L. Giuliani*

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [dynamic_bitset] Using of GCC built-in functions may increase performance of some methods

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On 08/23/18 05:08, Gavin Lambert via Boost wrote:

> On 23/08/2018 09:16, Andrey Semashev wrote:
>> I think such an optimization would be useful. Note that MSVC also has
>> intrinsics for popcount[1], although I don't think those are supported
>> when the target CPU doesn't implement the corresponding instructions.
>> You would have to check at compile time whether the target CPU
>> supports it (e.g. by checking if __AVX__ is defined).
>
> While compile-time detection is better, if you can do it (because it
> lets it be completely inlined); if the compile-time detection fails, you
> can still do runtime detection, eg. by defining something like:
>
> // header file
> extern int (*popcnt64)(uint64_t);
>
> // source file
> static bool is_popcnt_supported()
> {
>      int info[4] = { 0 };
>      __cpuid(info, 1);
>      return (info[2] & 0x00800000) != 0;
> }
>
> static int popcnt64_intrinsic(uint64_t v)
> {
>      return /* _mm_popcnt_64(v) or __builtin_popcountll(v) */;
> }
>
> static int popcnt64_emulation(uint64_t v)
> {
>     // code that calculates it with bit twiddling
> }
>
> static int popcnt64_auto(uint64_t v)
> {
>      popcnt64 = is_popcnt_supported()
>          ? &popcnt64_intrinsic
>          : &popcnt64_emulation;
>      return popcnt64(v);
> }
>
> int (*popcnt64)(uint64_t) = &popcnt64_auto;
>
> Repeat for other argument sizes as needed.  You could probably do
> something fancier with C++11 guaranteed static initialisation, but this
> will work on all compilers.

This code requires a separately compiled unit. It can be done in
header-only style, though.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [dynamic_bitset] Using of GCC built-in functions may increase performance of some methods

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
But not on ARM or PowerPC says the embedded guy. Though both Gcc and clang support __builtin_popcount() across all processors, which is what we are using these days. I did notice one library does a call out openCl for popcount ()

I honestly was thinking about the same sort approach, though all compile time based. But I wasn’t sure what you would do with a signed value? As the builtin treats everything as unsigned.  
Do you mask out the sign bit or just cast it, or consider it an error to take a bit count of a signed value?

Sent from my iPhone

> On Aug 22, 2018, at 10:08 PM, Gavin Lambert via Boost <[hidden email]> wrote:
>
>> On 23/08/2018 09:16, Andrey Semashev wrote:
>> I think such an optimization would be useful. Note that MSVC also has
>> intrinsics for popcount[1], although I don't think those are supported
>> when the target CPU doesn't implement the corresponding instructions.
>> You would have to check at compile time whether the target CPU
>> supports it (e.g. by checking if __AVX__ is defined).
>
> While compile-time detection is better, if you can do it (because it lets it be completely inlined); if the compile-time detection fails, you can still do runtime detection, eg. by defining something like:
>
> // header file
> extern int (*popcnt64)(uint64_t);
>
> // source file
> static bool is_popcnt_supported()
> {
>    int info[4] = { 0 };
>    __cpuid(info, 1);
>    return (info[2] & 0x00800000) != 0;
> }
>
> static int popcnt64_intrinsic(uint64_t v)
> {
>    return /* _mm_popcnt_64(v) or __builtin_popcountll(v) */;
> }
>
> static int popcnt64_emulation(uint64_t v)
> {
>   // code that calculates it with bit twiddling
> }
>
> static int popcnt64_auto(uint64_t v)
> {
>    popcnt64 = is_popcnt_supported()
>        ? &popcnt64_intrinsic
>        : &popcnt64_emulation;
>    return popcnt64(v);
> }
>
> int (*popcnt64)(uint64_t) = &popcnt64_auto;
>
> Repeat for other argument sizes as needed.  You could probably do something fancier with C++11 guaranteed static initialisation, but this will work on all compilers.
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [dynamic_bitset] Using of GCC built-in functions may increase performance of some methods

Boost - Dev mailing list
In reply to this post by Boost - Dev mailing list
On Wed, Aug 22, 2018 at 2:03 PM, Evgeny Shulgin via Boost <
[hidden email]> wrote:

> Hi all, when I was playing around the Boost.DynamicBitset library, I
> noticed the last lines of the count() function (which returns the number of
> bits in the bitset that are set):
>
> return do_count(m_bits.begin(), num_blocks(), Block(0),
>     static_cast<value_to_type<(bool)mode> *>(0));
>
> Well, this line was modified 10 years ago last time.
>
> Let's go deeper. GNU C provides several language features not found in ISO
> standard C. These include a bunch of bit operations - you can look at them
> here: https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html.
> For example, *__builtin_popcountll *counts the number of bits in an
> 'unsigned long long', so we can use this kind of code to calculate bits
> (don't judge, I wrote this fast and didn't think of coding standards):
>
> size_type res = 0;
> for (size_type i = 0; i < m_bits.size(); i++) {
>     res += __builtin_popcountll(m_bits[i]);
> }
>
> I checked that m_bits[] and popcount() both work with 64-bit values on my
> machine. It halved the running time of a program that has a bitset and
> calls count() over and over again - here's the code
> <https://gist.github.com/Izaron/5005fed4184ee6483699191dc5ef3677> I used.
> These __builtin_XXX functions can be used everywhere where it's possible,
> with preprocessor directives. Or is it actually used and longer time means
> something else?
>
>
By a curious coincidence, I have been working in that part of libc++ ;-)
You can take a look here:
    https://github.com/llvm-mirror/libcxx/blob/master/include/bit

Gcc and clang have provided the bit intrinsics for a long, long time -
since before libc++ cares about. Boost's mileage may vary.
The MSVC intrinsics are missing (for my purposes) important things - like
"noexcept" and "constexpr".

-- Marshall

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [dynamic_bitset] Using of GCC built-in functions may increase performance of some methods

Boost - Dev mailing list
On Sun, 26 Aug 2018 at 04:42, Marshall Clow via Boost <[hidden email]>
wrote:

> The MSVC intrinsics are missing (for my purposes) important things - like
> "noexcept" and "constexpr".
>

 They are C-"functions", they "are" noexcept, you can mark any C++-function
containing calls to those intrinsics noexcept, and you won't be lying. The
constexpr bit depends on what the C++-compiler does with those intrinsic
calls, with some exceptions, most intrinsics map to exactly one assembler
instruction, i.e. there is no reason why they could not be constexpr (but
that is a C++-concept).

degski
--
*“If something cannot go on forever, it will stop" - Herbert Stein*
*“No, it isn’t truth. Truth isn’t truth" - Rudolph W. L. Giuliani*

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: [dynamic_bitset] Using of GCC built-in functions may increase performance of some methods

Boost - Dev mailing list
On Sat, Aug 25, 2018 at 11:55 PM, degski <[hidden email]> wrote:

> On Sun, 26 Aug 2018 at 04:42, Marshall Clow via Boost <
> [hidden email]> wrote:
>
>> The MSVC intrinsics are missing (for my purposes) important things - like
>> "noexcept" and "constexpr".
>>
>
>  They are C-"functions", they "are" noexcept, you can mark any
> C++-function containing calls to those intrinsics noexcept, and you won't
> be lying. The constexpr bit depends on what the C++-compiler does with
> those intrinsic calls, with some exceptions, most intrinsics map to exactly
> one assembler instruction, i.e. there is no reason why they could not be
> constexpr (but that is a C++-concept).
>
>
They certainly *could be*, but they are not.

BTW,  the mapping to one assembler instruction is irrelevant for constexpr;
rather the question is "does the compiler know how to evaluate that at
compile time".

-- Marshall

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost