Shared pointer atomic assembler for ARM?

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Shared pointer atomic assembler for ARM?

Phil Endecott-48
Dear All,

I note that shared_ptr uses architecture-specific assembler for the
atomic operations needed for thread safe operations, on x86, ia64 and
ppc; it falls back to pthreads for other architectures.  Has anyone
quantified the performance benefit of the assembler?

Assuming that the benefit is significant, I'd like to implement it for
ARM.  Has anyone else looked at this?

ARM has a swap instruction.  I have a (very vague) recollection that
perhaps some of the newer chips have some other locked instructions
e.g. test-and-set, but I would want to code to the lowest common
denominator i.e. swap only.  Is this sufficient for what shared_ptr wants?

I note that since 4.1, gcc has provided built-in functions for atomic
operations.  But it says that "Not all operations are supported by all
target processors", and the list doesn't include swap; so maybe this
isn't so useful after all.  Another thought is that gcc now ships with
a tr1/shared_ptr which is based on shared_ptr from boost 1.32; maybe it
uses the gcc built-in atomics... checks... no, it calls something
called __exchange_and_add, the implementation of which in libstdc++
(disassembled) doesn't look atomic to me!

Anyway: Is this worthwhile?  Has it already been done?  Is swap sufficient?


Regards,

Phil.








_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Peter Dimov-5
Phil Endecott:

> Dear All,
>
> I note that shared_ptr uses architecture-specific assembler for the
> atomic operations needed for thread safe operations, on x86, ia64 and
> ppc; it falls back to pthreads for other architectures.  Has anyone
> quantified the performance benefit of the assembler?
>
> Assuming that the benefit is significant, I'd like to implement it for
> ARM.  Has anyone else looked at this?
>
> ARM has a swap instruction.  I have a (very vague) recollection that
> perhaps some of the newer chips have some other locked instructions
> e.g. test-and-set, but I would want to code to the lowest common
> denominator i.e. swap only.  Is this sufficient for what shared_ptr wants?
>
> I note that since 4.1, gcc has provided built-in functions for atomic
> operations.  But it says that "Not all operations are supported by all
> target processors", and the list doesn't include swap; so maybe this
> isn't so useful after all.

Can you try the SVN trunk version of shared_ptr and look at the assembly?
detail/sp_counted_base.hpp should choose sp_counted_base_sync.hpp for g++
4.1 and higher and take advantage of the built-ins.

To answer your question: no, a mere swap instruction is not enough for
shared_ptr, it needs atomic increment, decrement and compare and swap.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Gottlob Frege
On 9/5/07, Peter Dimov <[hidden email]> wrote:
>
> To answer your question: no, a mere swap instruction is not enough for
> shared_ptr, it needs atomic increment, decrement and compare and swap.
>

Of course, incr and decr can be built from CAS.

Tony
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Phil Endecott-48
In reply to this post by Peter Dimov-5
Hi Peter,

Peter Dimov wrote:

> Phil Endecott:
>> I note that shared_ptr uses architecture-specific assembler for the
>> atomic operations needed for thread safe operations, on x86, ia64 and
>> ppc; it falls back to pthreads for other architectures.  Has anyone
>> quantified the performance benefit of the assembler?
>>
>> Assuming that the benefit is significant, I'd like to implement it for
>> ARM.  Has anyone else looked at this?
>>
>> ARM has a swap instruction.  I have a (very vague) recollection that
>> perhaps some of the newer chips have some other locked instructions
>> e.g. test-and-set, but I would want to code to the lowest common
>> denominator i.e. swap only.  Is this sufficient for what shared_ptr wants?
>>
>> I note that since 4.1, gcc has provided built-in functions for atomic
>> operations.  But it says that "Not all operations are supported by all
>> target processors", and the list doesn't include swap; so maybe this
>> isn't so useful after all.
>
> Can you try the SVN trunk version of shared_ptr and look at the assembly?
> detail/sp_counted_base.hpp should choose sp_counted_base_sync.hpp for g++
> 4.1 and higher and take advantage of the built-ins.
Well it's quicker for me to try this:

int x;

int main(int argc, char* argv[])
{
   __sync_fetch_and_add(&x,1);
}

$ arm-linux-gnu-g++ --version
arm-linux-gnu-g++ (GCC) 4.1.2 20061028 (prerelease) (Debian 4.1.1-19)

$ arm-linux-gnu-g++ -W -Wall check_sync_builtin.cc
check_sync_builtin.cc:3: warning: unused parameter ‘argc’
check_sync_builtin.cc:3: warning: unused parameter ‘argv’
/tmp/ccwWxfsT.o: In function `main':
check_sync_builtin.cc:(.text+0x20): undefined reference to `__sync_fetch_and_add_4'
collect2: ld returned 1 exit status

(It does compile on x86, and the disassembly includes a "lock addl" instruction.)

As I mentioned before, gcc doesn't implement these atomic builtins on
all platforms, i.e. it doesn't implement them on platforms where the
hardware doesn't provide them.  I don't fully understand how this all
works in libstdc++ (there are too many levels of #include and #if for
me to follow) but there seems to be a __gnu_cxx::__mutex that they can
use in those cases.

> To answer your question: no, a mere swap instruction is not enough for
> shared_ptr, it needs atomic increment, decrement and compare and swap.

Well, I think you can implement a spin-lock mutex with swap:

int mutex=0;  // 0 = unlocked, 1 = locked

void lock() {
   do {
     int n=1;
     swap(mutex,n);  // atomic swap instruction
   } while (n==1);   // if n is 1 after the swap, the mutex was already locked
}

void unlock() {
   mutex=0;
}


So you could using something like that to protect the reference counts,
rather than falling back to the pthread method.  Or alternatively,
could you use a sentinel value (say -1) in the reference to indicate
that it's locked:

int refcount;

int read_refcount() {
   do {
     int r = refcount;
   } while (r==-1);
   return r;
}

int adj_refcount(int adj) {
   int r=-1;
   do {
     swap(refcount,r);
   } while (r==-1);
   refcount = r+adj;
}


(BTW, for gcc>=4.1 on x86 would you plan to use the gcc builtins or the
existing Boost asm?)

Regards,

Phil.





_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Phil Endecott-48
Phil Endecott wrote:
> Peter Dimov wrote:
>> Phil Endecott:
>>> I note that shared_ptr uses architecture-specific assembler for the
>>> atomic operations needed for thread safe operations, on x86, ia64 and
>>> ppc; it falls back to pthreads for other architectures.  Has anyone
>>> quantified the performance benefit of the assembler?

(I'm still curious about this.)

>>> Assuming that the benefit is significant, I'd like to implement it for
>>> ARM.  Has anyone else looked at this?
>>>
>>> ARM has a swap instruction.  I have a (very vague) recollection that
>>> perhaps some of the newer chips have some other locked instructions
>>> e.g. test-and-set, but I would want to code to the lowest common
>>> denominator i.e. swap only.  Is this sufficient for what shared_ptr wants?

>> To answer your question: no, a mere swap instruction is not enough for
>> shared_ptr, it needs atomic increment, decrement and compare and swap.
>
> Well, I think you can implement a spin-lock mutex with swap:
>
> int mutex=0;  // 0 = unlocked, 1 = locked
>
> void lock() {
>    do {
>      int n=1;
>      swap(mutex,n);  // atomic swap instruction
>    } while (n==1);   // if n is 1 after the swap, the mutex was already locked
> }
>
> void unlock() {
>    mutex=0;
> }
>
>
> So you could using something like that to protect the reference counts,
> rather than falling back to the pthread method.  Or alternatively,
> could you use a sentinel value (say -1) in the reference to indicate
> that it's locked:
>
> int refcount;
>
> int read_refcount() {
>    do {
>      int r = refcount;
>    } while (r==-1);
>    return r;
> }
>
> int adj_refcount(int adj) {
>    int r=-1;
>    do {
>      swap(refcount,r);
>    } while (r==-1);
>    refcount = r+adj;
> }


Right, I have had a go at implementing this.  The following code seems
to generate plausible assembler but has not been tested.  Maybe someone
has a multi-threaded shared pointer test program that does zillions of
shared pointer accesses to see if they ever mess up?  But I would prefer
to think that this sort of code can be "proved correct" by the "many eyes"
method!  So let me know if you can see any faults.  In particular, I
have never used weak pointers, so I don't really know what that stuff
is doing.

This code can also be found at:

https://svn.chezphil.org/boost/trunk/include/boost/detail/

Regards,

Phil.



#ifndef BOOST_DETAIL_SP_COUNTED_BASE_GCC_ARM_HPP_INCLUDED
#define BOOST_DETAIL_SP_COUNTED_BASE_GCC_ARM_HPP_INCLUDED

//
//  detail/sp_counted_base_gcc_arm.hpp
//
//  Copyright (c) 2001, 2002, 2003 Peter Dimov and Multi Media Ltd.
//  Copyright 2004-2005 Peter Dimov
//  Copyright 2007 Phil Endecott
//
// Distributed under the Boost Software License, Version 1.0. (See
// accompanying file LICENSE_1_0.txt or copy at
// http://www.boost.org/LICENSE_1_0.txt)
//
// ARM version using swap by Phil Endecott, based on sp_counted_base_nt.hpp
//

#include <typeinfo>

namespace boost
{

namespace detail
{

// All ARM processors since ARM3 (circa 1989) have an atomic swap instruction,
// but apart from a few very recent chips (the v6 architecture, not to be confused
// with the ARM6) this is the only atomic operation that they support.
//
// Implementing shared pointers using only swap requires a different approach
// from that used on other platforms: we store a sentinel value, -1, in the
// count variables to indicate that they are locked.  If an attempt to read
// a count gets -1, we spin until a non-negative value is returned.
//
// The following bit of assembler wraps the swap instruction.  It takes two
// register operands and one memory operands, i.e.
//      swp r1, r2, [r3]
// is equivalent to
//      ldw r1, [r3]
//      stw r2, [r3]


static inline long atomic_read_and_lock(long& mem)
{
  long val;
  do {
    // Equivalent to:
    //   val = mem;
    //   mem = -1;
    asm volatile ("swp\t%0, %1, [%2]"
                 :"=r" (val)
                 :"r"  (-1),
                  "r"  (&mem)
                 :"memory");
  } while (val==-1);
  return val;
}

static inline long atomic_inc(long& mem, long inc)
{
  return mem = atomic_read_and_lock(mem)+inc;
}


class sp_counted_base
{
private:

    sp_counted_base( sp_counted_base const & );
    sp_counted_base & operator= ( sp_counted_base const & );

    long use_count_;        // #shared
    long weak_count_;       // #weak + (#shared != 0)

public:

    sp_counted_base(): use_count_( 1 ), weak_count_( 1 )
    {
    }

    virtual ~sp_counted_base() // nothrow
    {
    }

    // dispose() is called when use_count_ drops to zero, to release
    // the resources managed by *this.

    virtual void dispose() = 0; // nothrow

    // destroy() is called when weak_count_ drops to zero.

    virtual void destroy() // nothrow
    {
        delete this;
    }

    virtual void * get_deleter( std::type_info const & ti ) = 0;

    void add_ref_copy()
    {
        atomic_inc(use_count_,+1);
    }

    bool add_ref_lock() // true on success
    {
        long c = atomic_read_and_lock(use_count_);
        if (c==0) {
            use_count_ = 0;
            return false;
        }
        use_count_ = c+1;
        return true;
    }

    void release() // nothrow
    {
        if( atomic_inc(use_count_,-1) == 0 )
        {
            dispose();
            weak_release();
        }
    }

    void weak_add_ref() // nothrow
    {
        atomic_inc(weak_count_,+1);
    }

    void weak_release() // nothrow
    {
        if( atomic_inc(weak_count_,-1) == 0 )
        {
            destroy();
        }
    }

    long use_count() const // nothrow
    {
        while (1) {
            long c = use_count_;
            if (c>=0) return c;
        }
    }
};

} // namespace detail

} // namespace boost

#endif  // #ifndef BOOST_DETAIL_SP_COUNTED_BASE_GCC_ARM_HPP_INCLUDED


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Peter Dimov-5
In reply to this post by Phil Endecott-48
Phil Endecott:

> Well it's quicker for me to try this:
>
> int x;
>
> int main(int argc, char* argv[])
> {
>   __sync_fetch_and_add(&x,1);
> }
>
> $ arm-linux-gnu-g++ --version
> arm-linux-gnu-g++ (GCC) 4.1.2 20061028 (prerelease) (Debian 4.1.1-19)
>
> $ arm-linux-gnu-g++ -W -Wall check_sync_builtin.cc
> check_sync_builtin.cc:3: warning: unused parameter â?~argcâ?T
> check_sync_builtin.cc:3: warning: unused parameter â?~argvâ?T
> /tmp/ccwWxfsT.o: In function `main':
> check_sync_builtin.cc:(.text+0x20): undefined reference to
> `__sync_fetch_and_add_4'

Do __sync_lock_test_and_set and __sync_lock_release work?

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Phil Endecott-48
Hi Peter,

Peter Dimov wrote:
> Do __sync_lock_test_and_set and __sync_lock_release work?

__sync_lock_test_and_set doesn't work.
__sync_lock_release compiles, but I think it's a NO-OP.


My initial thought was, "Why is Peter asking me this?  I have already
said that it only has swap, so it's obviously not going to support
test-and-set!".  But I was sensible and checked the gcc docs before
posting, and it has this:

   type __sync_lock_test_and_set (type *ptr, type value, ...)
       This builtin, as described by Intel, is not a traditional
test-and-set operation, but
       rather an atomic exchange operation. It writes value into *ptr,
and returns the previous
       contents of *ptr.

So "test and set" is actually an atomic swap, i.e. exactly what ARM has
- but misnamed.  Aarggh.  If it had the right name I would have noticed
it months ago.  (I note that gcc blame Intel for the misnaming.)

Yet gcc still doesn't seem to implement it.  I will go and ask on their
mailing list.

Regards,

Phil.




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Phil Endecott-48
Phil Endecott wrote:
> Peter Dimov wrote:
>> Do __sync_lock_test_and_set and __sync_lock_release work?
>
> __sync_lock_test_and_set doesn't work.

> So "test and set" is actually an atomic swap, i.e. exactly what ARM has

> Yet gcc still doesn't seem to implement it.  I will go and ask on their
> mailing list.

Here is what I have learnt from gcc-help:

- You can test for the presence of atomic-add and related builtins
using the __GCC_HAVE_SYNC_COMPARE_AND_SWAP_n predefined macros, so this
could be used to enable the sp_counted_base implementation that Peter
has done based on them.  Except that this macro is only defined in
relatively recent versions of the compiler, i.e. there are compilers
that have the builtins but don't advertise them with the macro.

- There is no ARM builtin for SWAP; I've filed a gcc bug suggesting it
(and a corresponding macro) as an enhancement.

- The code that I presented, which uses a sentinel value to lock the
count and spins if it finds it already locked, will have problems on a
system where a high-priority thread must explicitly yield to let lower
priority threads run.  This isn't the case on any system that I
(currently) care about, but it did make me wonder whether Boost has a
policy on supported threading models, and whether any of the other
Boost atomicity / threading code would suffer the same issue.


Regards,

Phil.




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Sid Sacek
In reply to this post by Phil Endecott-48

I don't like the spin-lock approach for the atomic variables.

If the mutex is already locked by another thread, spinning just wastes CPU
cycles on a single-processor machine until the time-slice for the current
thread comes to an end. That's usually a 10-millisecond waste of CPU in the
worst case scenario.

It might have some benefit on a multi-processor machine, but only if the
thread that currently owns the mutex lock has not already been pre-empted.

I don't see much benefit of the spin-lock approach over a system mutex,
since the system mutex at least allows other threads to continue running
while the mutex is being held.

A hybrid of a spin-lock and mutex sounds better to me than a simple
spin-lock. What I mean is that if the spin-lock does not terminate its loop
in short order, that means the lock loop is going last for a very long time
( possibly up to 10-milliseconds. ) If the loop does not exit quickly, then
yield the processor to some other thread, (possibly even the thread that's
holding the lock in the first place.)

There must be another 'better' method than the one which is described below.
I don't have any good answers right now myself, but I hope that my concern
was understood.

-Sid


-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
On Behalf Of Phil Endecott
Sent: Monday, September 10, 2007 3:05 PM
To: [hidden email]
Subject: Re: [boost] Shared pointer atomic assembler for ARM?

Phil Endecott wrote:
> Peter Dimov wrote:
>> Phil Endecott:
>>> I note that shared_ptr uses architecture-specific assembler for the
>>> atomic operations needed for thread safe operations, on x86, ia64 and
>>> ppc; it falls back to pthreads for other architectures.  Has anyone
>>> quantified the performance benefit of the assembler?

(I'm still curious about this.)

>>> Assuming that the benefit is significant, I'd like to implement it for
>>> ARM.  Has anyone else looked at this?
>>>
>>> ARM has a swap instruction.  I have a (very vague) recollection that
>>> perhaps some of the newer chips have some other locked instructions
>>> e.g. test-and-set, but I would want to code to the lowest common
>>> denominator i.e. swap only.  Is this sufficient for what shared_ptr
wants?

>> To answer your question: no, a mere swap instruction is not enough for
>> shared_ptr, it needs atomic increment, decrement and compare and swap.
>
> Well, I think you can implement a spin-lock mutex with swap:
>
> int mutex=0;  // 0 = unlocked, 1 = locked
>
> void lock() {
>    do {
>      int n=1;
>      swap(mutex,n);  // atomic swap instruction
>    } while (n==1);   // if n is 1 after the swap, the mutex was already
locked

> }
>
> void unlock() {
>    mutex=0;
> }
>
>
> So you could using something like that to protect the reference counts,
> rather than falling back to the pthread method.  Or alternatively,
> could you use a sentinel value (say -1) in the reference to indicate
> that it's locked:
>
> int refcount;
>
> int read_refcount() {
>    do {
>      int r = refcount;
>    } while (r==-1);
>    return r;
> }
>
> int adj_refcount(int adj) {
>    int r=-1;
>    do {
>      swap(refcount,r);
>    } while (r==-1);
>    refcount = r+adj;
> }


Right, I have had a go at implementing this.  The following code seems
to generate plausible assembler but has not been tested.  Maybe someone
has a multi-threaded shared pointer test program that does zillions of
shared pointer accesses to see if they ever mess up?  But I would prefer
to think that this sort of code can be "proved correct" by the "many eyes"
method!  So let me know if you can see any faults.  In particular, I
have never used weak pointers, so I don't really know what that stuff
is doing.

This code can also be found at:

https://svn.chezphil.org/boost/trunk/include/boost/detail/

Regards,

Phil.



#ifndef BOOST_DETAIL_SP_COUNTED_BASE_GCC_ARM_HPP_INCLUDED
#define BOOST_DETAIL_SP_COUNTED_BASE_GCC_ARM_HPP_INCLUDED

//
//  detail/sp_counted_base_gcc_arm.hpp
//
//  Copyright (c) 2001, 2002, 2003 Peter Dimov and Multi Media Ltd.
//  Copyright 2004-2005 Peter Dimov
//  Copyright 2007 Phil Endecott
//
// Distributed under the Boost Software License, Version 1.0. (See
// accompanying file LICENSE_1_0.txt or copy at
// http://www.boost.org/LICENSE_1_0.txt)
//
// ARM version using swap by Phil Endecott, based on sp_counted_base_nt.hpp
//

#include <typeinfo>

namespace boost
{

namespace detail
{

// All ARM processors since ARM3 (circa 1989) have an atomic swap
instruction,
// but apart from a few very recent chips (the v6 architecture, not to be
confused
// with the ARM6) this is the only atomic operation that they support.
//
// Implementing shared pointers using only swap requires a different
approach
// from that used on other platforms: we store a sentinel value, -1, in the
// count variables to indicate that they are locked.  If an attempt to read
// a count gets -1, we spin until a non-negative value is returned.
//
// The following bit of assembler wraps the swap instruction.  It takes two
// register operands and one memory operands, i.e.
//      swp r1, r2, [r3]
// is equivalent to
//      ldw r1, [r3]
//      stw r2, [r3]


static inline long atomic_read_and_lock(long& mem)
{
  long val;
  do {
    // Equivalent to:
    //   val = mem;
    //   mem = -1;
    asm volatile ("swp\t%0, %1, [%2]"
                 :"=r" (val)
                 :"r"  (-1),
                  "r"  (&mem)
                 :"memory");
  } while (val==-1);
  return val;
}

static inline long atomic_inc(long& mem, long inc)
{
  return mem = atomic_read_and_lock(mem)+inc;
}


class sp_counted_base
{
private:

    sp_counted_base( sp_counted_base const & );
    sp_counted_base & operator= ( sp_counted_base const & );

    long use_count_;        // #shared
    long weak_count_;       // #weak + (#shared != 0)

public:

    sp_counted_base(): use_count_( 1 ), weak_count_( 1 )
    {
    }

    virtual ~sp_counted_base() // nothrow
    {
    }

    // dispose() is called when use_count_ drops to zero, to release
    // the resources managed by *this.

    virtual void dispose() = 0; // nothrow

    // destroy() is called when weak_count_ drops to zero.

    virtual void destroy() // nothrow
    {
        delete this;
    }

    virtual void * get_deleter( std::type_info const & ti ) = 0;

    void add_ref_copy()
    {
        atomic_inc(use_count_,+1);
    }

    bool add_ref_lock() // true on success
    {
        long c = atomic_read_and_lock(use_count_);
        if (c==0) {
            use_count_ = 0;
            return false;
        }
        use_count_ = c+1;
        return true;
    }

    void release() // nothrow
    {
        if( atomic_inc(use_count_,-1) == 0 )
        {
            dispose();
            weak_release();
        }
    }

    void weak_add_ref() // nothrow
    {
        atomic_inc(weak_count_,+1);
    }

    void weak_release() // nothrow
    {
        if( atomic_inc(weak_count_,-1) == 0 )
        {
            destroy();
        }
    }

    long use_count() const // nothrow
    {
        while (1) {
            long c = use_count_;
            if (c>=0) return c;
        }
    }
};

} // namespace detail

} // namespace boost

#endif  // #ifndef BOOST_DETAIL_SP_COUNTED_BASE_GCC_ARM_HPP_INCLUDED


_______________________________________________
Unsubscribe & other changes:
http://lists.boost.org/mailman/listinfo.cgi/boost


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Andreas Pokorny
Hello,


2007/9/16, Sid Sacek <[hidden email]>:
> I don't see much benefit of the spin-lock approach over a system mutex,
> since the system mutex at least allows other threads to continue running
> while the mutex is being held.

In our company use spin locks instead of system mutex, because we
usually have close to one thousand objects that are used by different
threads. Creating such a large number is often not supported.


> A hybrid of a spin-lock and mutex sounds better to me than a simple
> spin-lock. What I mean is that if the spin-lock does not terminate its loop
> in short order, that means the lock loop is going last for a very long time
> ( possibly up to 10-milliseconds. ) If the loop does not exit quickly, then
> yield the processor to some other thread, (possibly even the thread that's
> holding the lock in the first place.)

Yes, we used that approach and on systems like MIPS with LL/SC we
also loop a few times before giving up. According to the documentation
many false-failures of the conditional seem to be possible. Hence trying
a few times seemed reasonable. Giving up is usually done by sleeping.

Doesnt ARM support ll/sc like operations?

Regards
Andreas Pokorny
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Phil Endecott-48
In reply to this post by Sid Sacek
Sid Sacek wrote:
> I don't like the spin-lock approach for the atomic variables.
>
> If the mutex is already locked by another thread, spinning just wastes CPU
> cycles on a single-processor machine until the time-slice for the current
> thread comes to an end. That's usually a 10-millisecond waste of CPU in the
> worst case scenario.

Hi Sid,

Yes, but because the variable is locked for just a handful of
instructions during the atomic increment/decrement, the probability of
being pre-empted is small.  So spinning is rare, and the average
slowdown is small.  It is more important to optimise the common case
where the variable is not locked.  In particular, spinning is much
better than using pthreads, which is what boost::shared_ptr does at
present on ARM Linux; the function call overhead alone will dwarf the
time spent spinning.  It also requires that each shared pointer is
larger since it must also store the mutex.

To quantify this, I've timed a simple single-threaded loop that does
100 million shared pointer increments and decrements.  With pthreads it
takes 180 seconds, while with my spinning implementation it takes 22
seconds.  That's about an additional microsecond per atomic operation.  
Comparing with the 10 ms that is wasted by spinning when an atomic
operation is interrupted, the balance is when one in every 10 000
atomic operations is interrupted by a thread that will access the same
variable.  On my hardware, the window during which the variable stores
the sentinel value is about 15 ns, so scheduling 100 times per second
the probability of interrupting an atomic operation will be one in 666
667.  That's a much lower probability than one in 10 000, so the
spinning solution is better.  [Do let me know if you can see any flaws
in my logic.]

An alternative is to simply yield when the thread is found to be locked:

#include <sched.h>

static inline long atomic_read_and_lock(long& mem)
{
   long val;
   do {
     // Equivalent to:
     //   val = mem;
     //   mem = -1;
     asm volatile ("swp\t%0, %1, [%2]"
                  :"=&r" (val)
                  :"r"  (-1),
                   "r"  (&mem)
                  :"memory");
     if (val != -1) {
       break;
     }
     sched_yield();
   } while (1);
   return val;
}

In the typical case this executes no more instructions than the spin
code, and my test program runs at the same speed.  But it does increase
the code size, which matters to me on my embedded system with not much
cache.  It is also OS-specific.

The benefits of each approach will depend on the workload, i.e. how
often shared pointers are actually used by more than one thread, and I
suggest benchmarking your application with each of the alternatives.


Regards,

Phil.




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Phil Endecott-48
In reply to this post by Andreas Pokorny
Andreas Pokorny wrote:
> Doesnt ARM support ll/sc like operations?

Not on the vast majority of currently-deployed processors; atomic swap
is all you get.  A few newer implementations do have some more
features, but they are not in widespread use yet.


Phil.






_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Sid Sacek
In reply to this post by Phil Endecott-48

Hi Phil,

I had a chance to think about this issue with spin-locks a bunch more.

Okay, while I agree with you that the possibility of a thread being
pre-empted in the middle of the assembly lock-routine is small, there is
still that possibility that it will happen, and depending on the
application, it might happen noticeably too often.

Knowing that, the programmer should be given the option whether to use ( or
not use ) the spin-locking mechanism in their code. I'm certain that most
applications will be fine using the locks in this manner, but there will be
cases where the possibility of wasting CPU cycles will be unacceptable. This
will all depend on the application and how often locking is performed in
that program.

I know that I personally would want the option of being able to turn that
feature on and off at my discretion, and I'm sure that others would too.

If you make it so that ARM coders are forced to always use the spin-locks,
they may then be forced into not using the boost::shared_ptrs and come up
with their own implementations.

If I had a say in this matter, I would request that the underlying locking
mechanism be configurable.

Regards,
-Sid Sacek


-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
On Behalf Of Phil Endecott
Sent: Sunday, September 16, 2007 12:16 PM
To: [hidden email]
Subject: Re: [boost] Shared pointer atomic assembler for ARM?

Sid Sacek wrote:
> I don't like the spin-lock approach for the atomic variables.
>
> If the mutex is already locked by another thread, spinning just wastes CPU
> cycles on a single-processor machine until the time-slice for the current
> thread comes to an end. That's usually a 10-millisecond waste of CPU in
the
> worst case scenario.

Hi Sid,

Yes, but because the variable is locked for just a handful of
instructions during the atomic increment/decrement, the probability of
being pre-empted is small.  So spinning is rare, and the average
slowdown is small.  It is more important to optimise the common case
where the variable is not locked.  In particular, spinning is much
better than using pthreads, which is what boost::shared_ptr does at
present on ARM Linux; the function call overhead alone will dwarf the
time spent spinning.  It also requires that each shared pointer is
larger since it must also store the mutex.

To quantify this, I've timed a simple single-threaded loop that does
100 million shared pointer increments and decrements.  With pthreads it
takes 180 seconds, while with my spinning implementation it takes 22
seconds.  That's about an additional microsecond per atomic operation.  
Comparing with the 10 ms that is wasted by spinning when an atomic
operation is interrupted, the balance is when one in every 10 000
atomic operations is interrupted by a thread that will access the same
variable.  On my hardware, the window during which the variable stores
the sentinel value is about 15 ns, so scheduling 100 times per second
the probability of interrupting an atomic operation will be one in 666
667.  That's a much lower probability than one in 10 000, so the
spinning solution is better.  [Do let me know if you can see any flaws
in my logic.]

An alternative is to simply yield when the thread is found to be locked:

#include <sched.h>

static inline long atomic_read_and_lock(long& mem)
{
   long val;
   do {
     // Equivalent to:
     //   val = mem;
     //   mem = -1;
     asm volatile ("swp\t%0, %1, [%2]"
                  :"=&r" (val)
                  :"r"  (-1),
                   "r"  (&mem)
                  :"memory");
     if (val != -1) {
       break;
     }
     sched_yield();
   } while (1);
   return val;
}

In the typical case this executes no more instructions than the spin
code, and my test program runs at the same speed.  But it does increase
the code size, which matters to me on my embedded system with not much
cache.  It is also OS-specific.

The benefits of each approach will depend on the workload, i.e. how
often shared pointers are actually used by more than one thread, and I
suggest benchmarking your application with each of the alternatives.


Regards,

Phil.




_______________________________________________
Unsubscribe & other changes:
http://lists.boost.org/mailman/listinfo.cgi/boost


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Shared pointer atomic assembler for ARM?

Phil Endecott-48
Sid Sacek wrote:
> Hi Phil,
>
> I had a chance to think about this issue with spin-locks a bunch more.
>
> Okay, while I agree with you that the possibility of a thread being
> pre-empted in the middle of the assembly lock-routine is small, there is
> still that possibility that it will happen, and depending on the
> application, it might happen noticeably too often.

Some numbers would be nice :-)

> Knowing that, the programmer should be given the option whether to use ( or
> not use ) the spin-locking mechanism in their code. I'm certain that most
> applications will be fine using the locks in this manner, but there will be
> cases where the possibility of wasting CPU cycles will be unacceptable. This
> will all depend on the application and how often locking is performed in
> that program.

I tend to disagree concerning the atomics used for shared pointers, but
in the more general case of mutex implementations I do agree that
offering the user two or more implementations would be useful.  I
wonder how this would fit in with the recently-proposed std::threads
work?  This needs investigation.

> I know that I personally would want the option of being able to turn that
> feature on and off at my discretion, and I'm sure that others would too.

Note that sp_counted_base.hpp has some preprocessor stuff to select an
implementation, and I believe that you can influence this with suitable
#defines; so if you wanted to use pthreads mutexes (i.e. to get the
current behaviour) then you could get that.  The only proviso is that
all code compiled together would need to use the same (or compatible) implementations.

> If you make it so that ARM coders are forced to always use the spin-locks,

Oh come off it.  I'm not in a position to "force" anyone to do
anything.  The chances of the code that I posted ever being included in
Boost are tiny - just look at how few replies there have been to this
thread.  And Boost doesn't have any ARM-based testing systems and ARM
isn't even a supported platform, so there would be justified objections
to ever including the code.  The only people who will ever use this
stuff are people who try the default Boost shared pointers, find that
they are very slow, find my implementation [either the original version
with pure spin-locks or the one I posted in the last message that
yields] and _choose_ to use it.

> they may then be forced into not using the boost::shared_ptrs and come up
> with their own implementations.

What does the code that I posted in my last email not do that you need?

> If I had a say in this matter, I would request that the underlying locking
> mechanism be configurable.

You do have a say, Boost is a democracy.

Thanks for your interest.

Regards,

Phil.




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost