CIS 4307: Spinlocks and Semaphores

[TestAndSet], [Spinlocks] [Definition and initial implementation of Semaphores], [Comments], [Posix Semaphores], [Java Semaphores], [Nutt's Implementation], [Linux's Futexes]

TestAndSet

We consider a machine instruction that supports concurrency. Always be conscious of the hardware support for Operating Systems functions. There are a number of reasons for this support. It may be the only way to solve problem and, almost always, it allows solutions that are more efficient than purely software solutions. [A classic example of harware support is for interrupts: during the execution of each instruction there is a test to see if a request for interrupt has arrived. If it has and is of sufficient priority, the instruction is suspended and the interrupt handled. Interrupts are made possible by (hidden) polling!]

We can think of the TestAndSet instruction as a function implemented atomically in hardware. Here is this function:

	int TestAndSet(int *x){
	   register int temp = *x;
	   *x = 1;
	   return temp;
	}

The hardware can achieve this effect as follows: the instruction involves two Processor/Memory bus interactions:

Read from x into a register temp
Write one to x

no access to that memory location x must be allowed (this can be achieved, for instance, by maintaining bus mastery) in between those two interactions. Note that this is a slow operation because it must wait for the retrieving of a word from memory and the storing there. No use is made of caches. TestAndSet is an example of atomic instructions that support concurrency. Another frequently used instruction is the atomic Exchange operation:

         void Exchange(int *a, int *b) {
            int temp = *a;
            *a = *b;
            *b = temp;
         }

The most important of these atomic operations is CompareAndSwap, denoted CAS:

         boolean CAS(int *a, int old, int new) {
	    int temp = *a;
            if (temp == old) {
               *a = new;
               return true;
            } else 
               return false;
         }

which is used in lock-free and wait-free algorithms. Here is an example: consider the operation x++. In most systems this is the result of three operations

	int temp = x;
	temp++;
	x = temp;

Thus it is non safe under concurrency. Here is a lock-free implementation of the increment

	int temp = x;
	while (!CAS(&x, temp, temp+1)) {
		temp = x;
	}

If x is concurrently accessed by, say, 10 activities then the loop will be executed at most 10 times. [May be more if one of the activities is while(true)x++;]

Here is another example of use of the CompareAndSwap instruction. Suppose we have a pointer Node **head to the anchor of a singly linked list of nodes, where a Node is

    typedef struct node {Node * next; int value;} Node;

and Node *p is a pointer to a node we want to add to the head of the list. This is a lock-free add:

    while (1) {
	Node * q = *head;
	p->next = q;
	if (CAS(head, q, p)
	    break;
    }

SpinLocks

Spinlocks are abstract data types, or classes, that support a Busy-Wait solution for the Mutual Exclusion Problem. The operations on spinlocks are InitLock, Lock, and UnLock. If mutex is a spinlock that has been initialized using InitLock, then Critical Regions S1, S2, .., Sn can be protected from each other by preceding each region by a call to Lock and following it by a call to UnLock. The region Si becomes:

	Lock(mutex);
	Si;
	UnLock(mutex);

The pseudo code for implementing SpinLocks follows:


	typedef int SpinLock;

	void InitLock(SpinLock *L)
	{   *L = 0; }

	void Lock(SpinLock *L)
	{   while (TestAndSet(L)) 
		;
	}

	void UnLock(SpinLock *L)
	{   *L = 0; }

As you can see, the implementation is quite simple and with no system overhead since TestAndSet is a user mode instruction. It works because of the properties of the TestAndSet instruction. Notice that this solution involves Busy Waiting (polling).
Here is code for this kind of spinlock that works on the i386 architecture.
On the Intel Itanium hardware, with multiple execution elements, this implementation would not work because code such as

	lock(&s);
	n++;
	unlock(&s);
might be executed as
	lock(&s);
	*s = 0;
	n++;
because of hardware reordering of instructions. It is  necessary in 
the unlock operation to insert a barrier instruction. [Same for lock.]
In fact a barrier should be considered integral part of the lock and 
unlock operations.

Here is a possible implementation of spinlocks using the Exchange instruction:

        typedef int SpinLock;
        void InitLock(SpinLock *s) {
           *s = 0;
        }
        void Lock (SpinLock *s) {
           int L = 1;
           do {
              Exchange(&L, s);
           } while (L == 1);
        }
        void UnLock (SpinLock *s) {
           *s = 0;
        }

and here using the CompareAndSwap instruction:

        typedef int SpinLock;
        void InitLock(SpinLock *s) {
           *s = 0;
        }
        void Lock (SpinLock *s) {
           do {
           } until (CompareAndSwap(s, 0, 1));
        }
        void UnLock (SpinLock *s) {
           *s = 0;
        }

The beauty of the CompareAndSwap instruction is that it can be used to implement data structures such as linked lists without the need of spinlocks or semaphores (non-locking operations).

A big problem of the spinlock is that by repeatedly testandsetting a location in memory it can use a high percentage of the memory bus bandwidth, thus reducing the utilization of other procesors sharing the memory bus. This location that is repeatedly accessed is called a Hot Spot in the memory. A partial solution (in the case of cache-coherent shared memories) is to change the Lock operation as follows:

	void Lock(SpinLock *L)
	{   while (TestAndSet(L))
               while (*L == 1)
		  ;
	}

which uses less of the memory bandwidth since the TestAndSet instruction involves both a read and a write transaction on the memory while (*L==1) involves only a read operation, and this read operation is normally carried out on a cached copy. [The inner While Loop operates on the cached copy without affecting the shared memory. It assumes that the hardware supports some form (like snooping caches) of cache coherence algorithm which invalidates the cache copy when the shared memory copy is updated.]

There is no guaranty of Fairness for spinlocks. That is, if two processes P and Q are trying to enter a critical region protected by a locked spinlock mutex, then there is no guaranty that they will be serviced in FIFO order. Their order of entry will be random. More serious is the problem that the process P may use the spinlock within a loop and use the lock repeatedly while Q is unable to acquire it. [How can this happen?] This behavior could be avoided by giving a "Nice" behavior to the UnLock operation: when the Unlock is executed it is checked if another process is waiting and if yes the current process defers (gives up control of CPU) to it. Of course, in most situation this is an unwarranted precaution.

Spinlocks are dangerous to use in uniprocessor systems because of the Priority Inversion Problem. [A low priority process is in the critical region. A high priority process tries to enter the region. The low priority process cannot get the CPU which is held by the high priority process. The high priority process spins in the spinlock trying to enter the region being held by the low priority process. We have a deadlock. ]

Two possible solutions to the priority inversion problem:

Execute the critical regions with masked interrupts. Thus the current process cannot be preempted (in the case we are considering there is only one processor).
Determine the maximum priority among the processes that can execute these critical regions, then make sure that all processes execute the critical regions at this priority. Since now competing processes have the same priority, they cannot preempt each other.

A related problem is the formation of scheduling convoys. This occurs when a process ceases to run in the middle of a critical region [because of page fault, quantum expiration, external interrupt] and then all other processes sharing this region have to wait for its completion. A number of techniques are being introduced to deal with this problem, in particular Preemption-safe Locking and Non-blocking Algorithms. [Preemption-safe locking tries to avoid preemption during a critical region by having the kernel extend the quantum of a process in the critical region or by increasing its priority. Non-blocking algorithms, expecially for implementing shared counters and queues, make sure that if a process blocks within a critical region, it can be brought back to the beginning of the critical region and another process is allowed to enter the region.]

In general spinlocks with their busy waiting mechanism are desirable (as compared to non busy waiting mechanisms like semaphores) when the scheduling and context switching time are greater than the time required to execute the critical region, or when there are processors not required to do other tasks.

The efficient implementation of spinlocks is of great concern because spinlocks are the usual foundation for synchronization, expecially in multiprocessor systems. They are used everywhere and very frequently. As the use of multiprocessor systems increases so will the interest in efficient spinlocks and in alternative synchronization mechanisms

Definition and initial implementation of Semaphores

People use the term "Semaphore" to refer to a variety of synchronization mechanisms. Here by "Semaphore" we mean the "non-busy-waiting" analog of SpinLocks. We show a possible implementation for semaphores. For now assume that semaphore's operations are atomic.


	type semaphore = record
			    count : integer;
  1.			    q     : queue of processes; // usually FIFO
			 end record;
	procedure initsem (s : in out semaphore; n : integer) is
	begin
	   s.count := n; initialize s.q to the empty queue;
	end;
	procedure P(s : in out semaphore) is
	begin
  2.	   s.count := s.count - 1;
  3.	   if s.count < 0 then 
  4.	      Remove this process from Running, add it 
	              to s.q, and reschedule;
	end;
	procedure V(s : in out semaphore) is
	begin
  5.	   s.count := s.count + 1;
  6.	   if s.count <= 0 and s.q is not empty then
  7.	      Remove a process from s.q, // Or one may remove all and have them compete
		// again for access to the semaphore
		      add it to the Ready 
  8.		      Queue [, and reschedule]; // Rescheduling may or not 
						// take place immediately
	end;

Comments

On line 1. we talk of a "queue of processes". A process is usually represented by a process id, a pointer or an index that refers to a Process Control Block (PCB). A PCB contains the relevant information about a process state.
Lines 2., 3., and 4. form a critical region. The same is true for lines 5., 6., and 7.. As such they need to be enforced either by masking interrupts (in single processor systems) or by using spinlocks.
The P operation is also called Down or Wait. The V operation is also called Up or Signal.
Reschedule operates as follows: say A is the highest priority process in the Ready queue (you may assume that there is always one such process, at least a Null process).
- On Line 8., assuming that reschedule is executed: If A has higher priority than the currently running process B, move B to the Ready queue and A to Running. [This is a pre-emption - in some cases, for efficiency, the switch is delayed until the time-slice expires or the current process blocks.]
- If A does not have higher priority than the currently running process B, do nothing.
- If there is no running process, move A to Running.
In all cases, resume the (possibly new) running process.
On line 7., if we extract from the semaphore queue more than one process and move them to the ready queue, on a multiprocessor system they may all be made to run and all but one will block again on the semaphore. This is called the Thundering Herd Problem.
On line 8. if we do not execute rescheduling, and wait for the running process to give up voluntarily, or to exceed allocated time quantum, we gain in efficiency, but the latency for a higher priority waiting process will increase since it will have to wait for the lower priority (slower) process to give up. This creates an effect called the Convoy Problem because similar to what happens on a road where a slower vehicle holds back faster vehicles.
If initsem is called with the parameter n set to 0, we say that we have a Blocking Semaphore; if with 1, we say that we have a Boolean or Mutual Exclusion Semaphore; and if n is greater than 1, we say that we have a Counting Semaphore.
Boolean semaphores are the non-busy-waiting analogs of spinlocks.

Using semaphores we can achieve a number of different behaviors:

If S is a boolean semaphore, then we can implement related critical regions by prefixing them with the statement P(S); and affixing them with V(S);
If S is a counting semaphore initialized to n (say, 10), then up to n (10) different processes could be executing a region enclosed by P(S)..V(S) statements.
If S is a blocking semaphore, then we can make sure that a process completes execution of statement A before another process starts execution a statement B, by following A with the statement V(S) and preceding B with the statement P(S). This is for example one way of implementing the arc semantics in Precedence Graphs.

Semaphores do not suffer the same form of the Priority Inversion Problem as Spinlocks. You may see the kind of Priority Inversion Problem that can arise for Semaphores in the following note about the Mars Rover. We have a high priority process communicating with a low priority process using a small critical region. While the low priority process is in the critical region the high priority process waits in the semaphore protecting the region. If an intermediate, independent process starts running at this point, it preempts the low priority process and prevents the higher priority job to get on its important way. Priority Inheritance avoids the priority inversion problem. It works like this: if a low priority process is in a critical region and it is interrupted by a higher priority process, it will conitue the critical region at the priority of the other process. When the critical region is terminated the process reverts to its original priority.

Posix Semaphores

Posix has defined semaphores that have the behavior we have described above. However at present they work only within a process to synchronize threads, not across processes. We will not use them in this course preferring the use of mutex locks and condition variables (see the threads discussion in the notes).
The operations defined in Posix for semaphores are:

       #include < semaphore.h>

       int sem_init(sem_t *sem, int pshared, unsigned int value);
       int sem_wait(sem_t * sem);
       int sem_trywait(sem_t * sem);
       int sem_post(sem_t * sem);
       int sem_getvalue(sem_t * sem, int * sval);
       int sem_destroy(sem_t * sem);

where sem_init as second parameter should have 0 (meaning semaphore used for thread) and as third parameter should have the intended initial count for the semaphore (1 => mutual exclusion, 0 => blocking, > 1 => counting). sem_wait and sem_post are the V and P operations. sem_trywait always succeeds: if count was positive, it acts as sem_wait and returns 0, and if count is <=0 it does nothing and returns -1. sem_getvalue jest returns the current count. sem_destroy is the destructor for this object.

Java Semaphores

Though Java currently does predefines semaphores, preferring to allow users to use the more general synchronized method approach, the new version of Java (1.5) will predefine a Semaphore class with characteristics similar to what described in this note, but more general.

Nutt's Implementation

Here we show code similar to the code used by G.J.Nutt in his Operating Systems book. This code implements N semaphores for a multiprocessing system.

#define	 N   ..         /* The number of semaphores */

int      value[N];	/* Keeps the count for the semaphores   */
boolean  sem[N];	/* Spinlocks to protect critical region */
			/*   of semaphore */
boolean  hold[N];	/* Used to temporarily "enqueue" tasks  */
			/* waiting on sem */

void initsem(int s, m) {
	value[s] = m; sem[s] = FALSE; hold[s] = TRUE;}
void P(int s) {
	while (TS(sem[s]));	/* Lock */
	value[s]--;
	if (value[s] < 0) {
	   sem[s] = FALSE;	/* Unlock */
	   while(TS(hold[s])) yield(Q); /* Here the process enters */
			/* a new Critical Region where it gives    */
			/* control to Q, a scheduler. It is a loop */
			/* since we may repeatedly be given control*/
			/* back from yield and find a blocked semaphore.*/
	}else
	   sem[s] = FALSE;}	/* Unlock */
void V(int s){
	while (TS(sem[s]));	/* Lock */
	value[s]++;
	if (value[s] <= 0){	/* Somebody is waiting in hold to enter CR*/
	   while (!hold[s]);	/* We check if there is a concurrent V. */
				/* It is a loop */
				/* because we might otherwise have two or more */
				/* consecutive V operations all setting (see next */
				/* statement) hold[s] to false. */
	   hold[s] = FALSE;	/* Here we free a process waiting in P,*/
				/* namely on hold[s].  */
	}
	sem[s] = FALSE;}	/* Unlock */

Linux's Futexes

Spinlocks are in user space, thus no need to transfer to kernel, but with busy waiting. Semaphores are kernel objects, thus much more expensive. In recent version of Linus it is made use of an object, called futex that has the advantages of both and can be used as a foundation to create a number of other synchronization primitives.

The idea of futexes is fairly simple. A futex consists of an integer initialized to 1 in shared user memory, say ulock, and an associated kernel object, say klock. Then we can think of the P and V operations executed in user mode as follows:

    	P() {
1.	    int temp = atomicDecrement(ulock);
2.	    if (temp < 0) // contention: the critical region is in use
		systemCall to operate on klock, with appropriate behavior.
	}

	V() {
3.	    int temp = atomicIncrement(ulock);
4.	    if (temp <= 0) // contention: somebody is waiting for critical region
		systemCall to operate on klock, with appropriate behavior.
	}

Since most of the time there is no contention when trying to get into a critical region, this implementation will most of the time be executed totally in user mode.

Of course the big problem is that it is totally possible that while process1 is in the critical region process2 executes 1. and before it can block process1 may execute 3. and 4. Now unless we are careful process2 can go to go to sleep and not be awaken. A simple solution is to have as klock a blocking semaphore. Now the system call in 2. is replaced by P(klock) and the system call in 4. is replaced by V(klock). If it happens that 4. is executed before 2., then process2 will not block, and if 2.is executed before 4. process2 will block to be awaken by 4.

In reality futexes are much more complex and harder to use than indicated above. But the description above gives some idea on the issues considered and may motivate you to read more on this subject. The number of issues that need consideration, beyond just guaranties of correct synchronization, fairness, etc., are recoverability [if a process terminates while holding a lock, how do we recognize that is the case and appropriately modify the lock state?] and security [how do we make sure that only the processes that have the right access a shared lock/].