Just another site

Thought this was cool: This Is Why They Call It a Weakly-Ordered CPU

leave a comment »

Comments: ” This Is Why They Call It a Weakly-Ordered CPU”


On this blog, I’ve been rambling on about lock-free programming subjects such as acquire and release semantics and weakly-ordered CPUs. I’ve tried to make these subjects approachable and understandable, but at the end of the day, talk is cheap! Nothing drives the point home better than a concrete example.

If there’s one thing that characterizes a weakly-ordered CPU, it’s that one CPU core can read values from shared memory in a different order than another core wrote them. That’s what I’d like to demonstrate in this post using pure C++11.

For normal applications, the x86/64 processor families from Intel and AMD do not have this characteristic. So we can forget about demonstrating this phenomenon on pretty much every modern desktop or notebook computer in the world. What we really need is a weakly-ordered multicore device. Fortunately, I happen to have one right here in my pocket:

The iPhone 4S fits the bill. It runs on a dual-core ARM-based processor, and the ARM architecture is, in fact, weakly-ordered.

Our experiment will consist of an single integer, sharedValue, protected by a mutex. We’ll spawn two threads, and each thread will run until it has incremented sharedValue 10000000 times.

We won’t let our threads block waiting on the mutex. Instead, each thread will loop repeatedly doing busy work (ie. just wasting CPU time) and attempting to lock the mutex at random moments. If the lock succeeds, the thread will increment sharedValue, then unlock. If the lock fails, it will just go back to doing busy work. Here’s some pseudocode:

count = 0
while count

With each thread running on a separate CPU core, the timeline should look something like this. Each red section represents a successful lock and increment, while the dark blue ticks represent lock attempts which failed because the other thread was already holding the mutex.

It bears repeating that a mutex is just a concept, and there are many ways to implement one. We could use the implementation provided by std::mutex, and of course, everything will function correctly. Instead, let’s implement a custom mutex — then let’s break it to demonstrate the consequences of weak hardware ordering. Intuitively, the potential for memory reordering will be highest at those moments when there is a “close shave” between threads — for example, at the moment circled in the above diagram, when one thread acquires the lock just as the other thread releases it.

The latest version of Xcode has terrific support for C++11 threads and atomic types, so let’s use those. All C++11 identifiers are defined in the std namespace, so let’s assume using namespace std; was placed somewhere earlier in the code.

A Ridiculously Simple Mutex

Our mutex will consist of a single integer flag, where 1 indicates that the mutex is held, and 0 means it isn’t. To ensure mutual exclusivity, a thread can only set flag to 1 if the previous value was 0, and it must do so atomically. To achieve this, we’ll define flag as a C++11 atomic type, atomic<int>, and use a read-modify-write operation:

int expected = 0;
if (flag.compare_exchange_strong(expected, 1, memory_order_acquire))
 // The lock succeeded

The memory_order_acquire argument used above is considered an ordering constraint. We’re placing acquire semantics on the operation, to help guarantee that we receive the latest shared values from the previous thread who held the lock.

To release the lock, we perform the following:, memory_order_release);

This sets flag back to 0 using the memory_order_release ordering constraint, which applies release semantics. Acquire and release semantics must be used as a pair to ensure that shared values propagate completely from one thread to the next.

If We Don’t Use Acquire and Release Semantics…

Now, let’s write the experiment in C++11, but instead of specifying the correct ordering constraints, let’s put memory_order_relaxed in both places. This means no particular memory ordering will be enforced by the C++11 compiler, and any kind of reordering is permitted.

void IncrementSharedValue10000000Times(RandomDelay& randomDelay)
 int count = 0;
 while (count < 10000000)
 int expected = 0;
 if (flag.compare_exchange_strong(expected, 1, memory_order_relaxed))
 // Lock was successful
 sharedValue++;, memory_order_relaxed);

At this point, it’s informative to look at the resulting ARM assembly code generated by the compiler, in Release, using the Disassembly view in Xcode:

If you aren’t very familiar with assembly language, don’t worry. All we want to know is whether the compiler has reordered any operations on shared variables. This would include the two operations on flag, and the increment of sharedValue in between. Above, I’ve annotated the corresponding sections of assembly code. As you can see, we got lucky: The compiler chose not to reorder those operations, even though the memory_order_relaxed argument means that, in all fairness, it could have.

I’ve put together a sample application which repeats this experiment indefinitely, printing the final value of sharedValue at the end of each trial run. It’s available on GitHub if you’d like to view the source code or run it yourself.

Here’s the iPhone, hard at work, running the experiment:

And here’s the output from the Output panel in Xcode:

What’s going on? The final value of sharedValue is consistently less than 20000000, even though both threads perform exactly 10000000 increments, and the order of assembly instructions exactly matches the order of operations on shared variables as specified in C++.

You guessed it: This result is entirely due to memory reordering on the CPU. To point out just one possible reordering — and there are several — the memory interaction of str.w r0, [r11] (the store to sharedValue) could be reordered with that of str r5, [r6] (the store of 0 to flag). In other words, the mutex could be effectively unlocked before we’re finished with it! As a result, the next thread would be free to wipe out the change made by this one, resulting in a mismatched sharedValue count at the end of the experiment, just as we’re seeing here.

Using Acquire and Release Semantics Correctly

Fixing our sample application, of course, means putting the correct C++11 memory ordering constraints back in place:

void IncrementSharedValue10000000Times(RandomDelay& randomDelay)
 int count = 0;
 while (count memory_order_acquire))
 // Lock was successful
 sharedValue++;, memory_order_release);

As a result, the compiler now inserts a couple of dmb ish instructions, which act as memory barriers in the ARMv7 instruction set. I’m not an ARM expert — comments are welcome — but it’s safe to assume this instruction, much like lwsync on PowerPC, provides all the memory barrier types needed for acquire semantics on compare_exchange_strong, and release semantics on store.

This time, our little home-grown mutex really does protect sharedValue, ensuring all modifications are passed safely from one thread to the next each time the mutex is locked.

If you still don’t grasp intuitively what’s going on in this experiment, I’d suggest a review of my source control analogy post. In terms of that analogy, you can imagine two workstations each having local copies of sharedValue and flag, with some effort required to keep them in sync. Personally, I find visualizing it this way very helpful.

Interesting Notes

You can try running the sample application on any Windows, MacOS or Linux machine with a multicore x86/64 CPU, but unless the compiler performs reordering on specific instructions, you’ll never witness memory reordering at runtime. Indeed, when I tested it using Visual Studio 2012, no memory reordering occurred. That’s because x86/64 processors have what is usually considered a strong memory model.

This goes to show how easy it is to use C++11 atomics incorrectly without knowing it, simply because it appears to work correctly on a specific processor and toolchain.

Incidentally, Visual Studio 2012 generates rather poor x86 machine code for this sample. It’s nowhere near as efficient as the ARM code generated by Xcode. Meanwhile, performance is the main reason to use lock-free programming on multicore in the first place. It’s enough to turn me off using C++11 atomics on Windows for the time being.

This post is a followup to an earlier post where I demonstrated StoreLoad reordering on x86/64. In my experience, however, #StoreLoad does not come up quite as often in practice as the barrier types demonstrated here.

Finally, I’m not the first person to demonstrate weak hardware ordering in practice, though I might be the first to demonstrate it using C++11. There are earlier posts by Pierre Lebeaupin and ridiculousfish which use different experiments to demonstrate the same phenomenon.

from Hacker News 50:


Written by cwyalpha

十月 19, 2012 在 4:05 下午

发表在 Uncategorized


Fill in your details below or click an icon to log in: 徽标

You are commenting using your account. Log Out /  更改 )

Google+ photo

You are commenting using your Google+ account. Log Out /  更改 )

Twitter picture

You are commenting using your Twitter account. Log Out /  更改 )

Facebook photo

You are commenting using your Facebook account. Log Out /  更改 )


Connecting to %s

%d 博主赞过: