Before moving beyong locks, we will first describe how to use locks in some common data
structures. Adding locks to a data structure to make it usable by threads makes the structure
thread safe. Of course, exactly how such locks are added determines both the correctness
and performance of the data structure. And thus, our challenge:
CRUX: How To Add Locks To Data Structures
When given a particular data structure, how shoule we add locks to it, in order to make it work
correctly? Further, how do we add locks such that the data structure yields high performance,
enabling mang threads to access the structure at once, i.e., concurrently?
Of course, we will be hard pressed to cover all data structures or all methods for adding
concurrency, as this is a topic that has been studied for years, with literally thousands of
research papers published about it. Thus, we hope to provide a sufficient introduction to the
type of thinking required, and refer you to some good resources of material for further inquiry
on your own. We found Moir and Shavit's survey to be a great source of information.
Concurrent Counters
One of the simplest data structures is a counter, it is a structure that is commonly used and has
a simple interface. We define a simple nonconcurrent counter in Figure 29.1.
Simple But Not Scale
As you can see, the non-synchronized counter is a trivial data structure, requiring a tiny amount
of code to implement. We now have our next challenge: how can we make this code thread safe
? Figure 29.2 shows how we do so.
This concurrent counter is simple and works correctly. In face, it follows a design pattern
common to the simplest and most basic concurrent data structures: it simply adds a single
lock, which is acquired when calling a routne that manipulates the data structure, and is
released when returning from the call. In this manner, it is similar to a data structure built
with monitors, wherer locks are acquired and released automatically as you call and return
from object methods.
At this point, you have a working concurrent data structure. The problem you might have is
performance. If your data structure is too slow, you will have to do more than just add a single
lock; such optimization, if needed, are thus the topic of the rest of the chapter. Note that if the
data structure is not too slow, you are done! No need to do something fancy if something simple
will work.
To understand the performance costs of the simple approach, we run a benchmark in which
each thread updates a single shared counter a fixed number of times; we then vary the number
of threads. Figure 29.3 shows the total time taken, with one or four threads active; each
thread updates the counter one million times. This experiment was run upon an iMac with four
Intel 2.7GHz i5 CPUs; with more CPUs active, we hope to get more total work done per unit
time.
From the top line in the figure (labeled precise), you can see that the performance of the
synchronized counter scales poorly. Whereas a single thread can complete the million counter
updates in a tiny amount of time (roughly 0.03 second), having two threads each update
the counter one million times concurrently leads to a massive slowdown (taking over 5
seconds!). It only gets worse with more threads.
Ideally, you would like to see the threads complete just as quickly on multiple processors as the
single thread does on one. Achieving this end is called perfect scaling; even though more work
is done, it is done in parallel, and hence the time taken to complete the task is not increased.
Scalable Counting
Amazingly, researchers have studied how to build more scalable counters for yeas. Even more
amazing is the fact that scalable counters matter, as recent work in operating system
performance analysis has shown; without scalable counting, some workloads running on Linux
suffer from serious scalability problems on multicore machines.
Though many techniques have been developed to attacj this problem, we will now describe one
particular apparoach. The idea, introduced in recent research, is known as a sloppy counter.
The sloppy counter works by representing a single logical counter with numerous local physical
counters, one per CPU core, as well as a single global counter. Specifically, on a machine with
four CPUs, there are four local counters and one global one. In addition to these counters,
there are also locks: one for each local counter, and one for global counter.
The basic idea of sloppy counting is as follows. When a thread running on a given core wishes
to increment the counter, it increments its local counter; access to this local counter is
synchronized via the corresponding local lock. Because each CPU has its own local counter,
threads across CPUs can update local counters without contention, and thus counter updates
are scalable.
However, to keep the global counter up to date (in case a thread wishes to read its value), the
local values are periodically transferred to the global counter, by acquiring the global lock and
incrementing it by the local counter's value; the local counter is then reset to zero.
How often this local-to-global transfer occurs is determined by a threshold, which we call S here
(for sloppiness). The smaller S is, the more the counter behaves like the non-scalable counter
above; the bigger S is, the more scalable the counter, but the further off the the global value
might be from the actual count. One could simply acquire all the local locks and the global lock (
in a specified order, to avoid deadlock) to get an exact value, but that is not scalable.
To make this clear, let's look at an example. In this example, the threshold S is set to 5, and
there are threads on each of four CPUs updating their local counters L1, L2, L3, and L4. The
global counter value (G) is also shown in the trace, with time increasing downward. At each
time step, a local counter may be incremented; if the local value reaches the threshold S, the
local value is transferred to the global counter and the local counter is reset.
The lower line in Figure 29.3 (labeled sloppy, on page 3) shows the performance of sloppy
counters with a threshold S of 1024. Performance is excellent; the time taken to update the
counter four million times on four processors is hardly higher than the time taken to update it
one million times on one processor.
Figure 29.6 shows the importance of threshold value S, with four thread each incrementing
the counter 1 million times on four CPUs. If S is low, performance is poor (but the global
count is always quite accurate); if S is high, performance is excellent, but the global count
lags (by the number of CPus multiplied by S). This accuracy/performance trade-off is what
sloppy counters enables.
A rough version of such a sloppy counter is found is in Figure 29.5. Read it, or better yet,
run it yourself in some experiments to better understand how it works.