Problems with yield_k workaround. Still too slow.

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Problems with yield_k workaround. Still too slow.

Marcello Pietrobon
Hello Boost team,

As exposed by Gav Wood in http://boost.2283326.n4.nabble.com/Boost-Interprocess-Chronic-performance-on-Win7-td3755619.html Windows doesn't handle well the contest switching between threads.
Basically if a thread is given a time to sleep for 1ms then the OS puts that thread on sleep for its entire timeslice (~20ms), no matter if the concurrent threads need a much shorter time to complete their job.

The effect of this choice is that if you need to have two threads talking with each other through an emulated synchronization object, they take well 2 seconds to do just 100 back and forth.
At least it seems to me that this problem arises only when emulating  a synchronization object, which I understand is necessary, for example, in the interprocess library.

It's not a problem of easy solution.

There is an ongoing discussion in the interprocess library on how to fix this (see for example http://boost.2283326.n4.nabble.com/Boost-Interprocess-conditions-variables-get-10-times-faster-when-opening-a-multiprocess-browser-td4650763.html ) but clearly this is not a problem limited to it.

The workaround http://www.boost.org/doc/libs/1_54_0/boost/smart_ptr/detail/yield_k.hpp improves the situation, yet only a bit.

In order to isolate the problem I've run some tests using the interprocess library, with the example comp_doc_anonymous_conditionA.cpp (and B).
This test is slowed down by this time of a factor ranging from 2 to 10, 100 or even 1000 or more times  depending on how soon that infamous Sleep(1) is triggered.

The bottom line is that in yield_k.cpp we need to change the constant in the code
else if( k < 32 ){
from 32 to  1000  at least.
I call this constant kMax:
else if( k < kMax){

Here some test results, running XPsp3 on Xeon E5450 - dual socket 32bit CPU quad cores. 2GHz core speed.

For test I changed comp_doc_anonymous_conditionA.cpp :
const int NumMsg = 100000;
that is 100000 iterations,

while in comp_doc_anonymous_conditionB.cpp I've added the line
for ( int k=0; k < 100; ++ k ) s += "1";
just to emulate a job done by the thread while owning the mutex.

That is:
      boost::posix_time::ptime now = microsec_clock::universal_time();
      std::string s;

      //Print messages until the other process marks the end
      bool end_loop = false;
      do{
         scoped_lock<interprocess_mutex> lock(data->mutex);
         if(!data->message_in){
            data->cond_empty.wait(lock);
         }

         // just a code doing some job
         for ( int j=0; j < jMax; ++ k ) s += "1";

         if(std::strcmp(data->items, "last message") == 0){
            end_loop = true;
         }
         else{
            //Print the message
  //          std::cout << data->items << std::endl;
            //Notify the other process that the buffer is empty
            data->message_in = false;
            data->cond_full.notify_one();
         }
      }
      while(!end_loop);

The results are (these numbers are representative of many tests I've run) :
kMax = 32, jMax = 0 : time = 00:00:01.125000
kMax = 50, jMax = 0 : time = 00:00:00.403750
kMax = 100, jMax = 0 : time = 00:00:00.2803750
kMax = 65535, jMax = 0 : time = 00:00:00.203125
kMax ~= 64 is good

kMax = 32, jMax = 10 : time = 00:00:32.843750
kMax = 100, jMax = 10 : time = 00:00:02.750000
kMax = 1000, jMax = 10 : time = 00:00:00.859375
kMax = 65535, jMax = 10 : time = 00:00:00.328125
kMax ~= 1000 is good

kMax = 32, jMax = 100 : time = 00:04:01.953125
kMax = 50, jMax = 100 : time = 00:00:04.296875
kMax = 1000, jMax = 100 : time = 00:00:02.062500
kMax = 65535, jMax = 100 : time = 00:00:01.593750
kMax ~= 1000 is good

If the job takes longer we need to increase kMax even more,

but it seems that a value of 65535 cuts the problem and makes sure that a locked thread doesn't take most of the cpu time for more than one timeslice.

Reply | Threaded
Open this post in threaded view
|

Re: Problems with yield_k workaround. Still too slow.

Peter Dimov-2
Marcello Pietrobon wrote:
> *The bottom line is that in yield_k.cpp we need to change the constant in
> the code *
>
> from 32 to  *1000 * at least.

I do not object to such a change (32 -> 1024 and perhaps 16 -> 64 or higher
for the 'pause' case as well); however, it would be prudent to first test it
against other usage scenarios. I can provide two test cases that would
likely be affected, sp_atomic_mt_test.cpp and sp_atomic_mt2_test.cpp, and
I'm sure that there are other Interprocess usage examples that can be
checked.

http://www.boost.org/doc/libs/1_54_0/libs/smart_ptr/test/sp_atomic_mt_test.cpp
http://www.boost.org/doc/libs/1_54_0/libs/smart_ptr/test/sp_atomic_mt2_test.cpp


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Problems with yield_k workaround. Still too slow.

Ion Gaztañaga
El 21/08/2013 10:41, Peter Dimov escribió:

> I do not object to such a change (32 -> 1024 and perhaps 16 -> 64 or
> higher for the 'pause' case as well); however, it would be prudent to
> first test it against other usage scenarios. I can provide two test
> cases that would likely be affected, sp_atomic_mt_test.cpp and
> sp_atomic_mt2_test.cpp, and I'm sure that there are other Interprocess
> usage examples that can be checked.
>
> http://www.boost.org/doc/libs/1_54_0/libs/smart_ptr/test/sp_atomic_mt_test.cpp
>
> http://www.boost.org/doc/libs/1_54_0/libs/smart_ptr/test/sp_atomic_mt2_test.cpp

As I wrote in the "[Boost.Interprocess] conditions variables get 10
times faster when opening a multiprocess browser" thread:

"It's definitely hard to tell if 1000 will be OK for everyone, as it
might depend on the CPU speed or waiter count". It might also depend on
the CPU count (on uniprocessor we should always yield)

spinlock implementations found on the net are quite varied:

-> some only yield once and sleep after that because they say it's
better on high contention cases
-> some just spin (optimized for very short waits).
-> some measure wait times to know how much to spin

http://www.1024cores.net/home/lock-free-algorithms/tricks/spinning

http://src.chromium.org/svn/trunk/src/third_party/tcmalloc/chromium/src/base/spinlock.cc

I'm afraid there is no a unique solution. In Marcello's example
(condition variable) I think he needs to yield current timeslice to the
sibling thread as it needs a fast roundtrip between the waiter and the
notifier.

Best,

Ion

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Reply | Threaded
Open this post in threaded view
|

Re: Problems with yield_k workaround. Still too slow.

Marcello Pietrobon
The latest version, by Ion, using his spin_wait class solves completely the problem, at least in the cases I could test.

An interesting note: Microsoft has introduced a function that completely solves the sleep(1) problem:
http://msdn.microsoft.com/en-us/library/system.threading.thread.spinwait.aspx?cs-save-lang=1&cs-lang=cpp#code-snippet-1

No comments on their decision to provide it only for their .net library and not for C++ as it seems to be.
Reply | Threaded
Open this post in threaded view
|

Re: Problems with yield_k workaround. Still too slow.

Niall Douglas
On 7 Sep 2013 at 10:59, Marcello Pietrobon wrote:

> The latest version, by Ion, using his spin_wait class solves completely the
> problem, at least in the cases I could test.
>
> An interesting note: Microsoft has introduced a function that completely
> solves the sleep(1) problem:
> http://msdn.microsoft.com/en-us/library/system.threading.thread.spinwait.aspx?cs-save-lang=1&cs-lang=cpp#code-snippet-1
>
> No comments on their decision to provide it only for their .net library and
> not for C++ as it seems to be.

It's very, very easy to roll your own in C++. Simply loop the PAUSE
x86 instruction (PAUSE = REP NOP, i.e. loop no-op). Hyperthreading
CPUs understand this means to go execute the other Hyperthread
instead.

Niall
--
Currently unemployed and looking for work.
Work Portfolio: http://careers.stackoverflow.com/nialldouglas/





_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

SMime.p7s (8K) Download Attachment