CIS 307: Clocks

CIS 307: Logical (and physical) Clocks

Read Tanenbaum-vanSteen: 241-246, 252-261

Physical Clocks

Coordinated Universal Time (UTC) is the standard used for time, based on an atomic clock. There are radio stations that transmit a pulse each UTC second. At receivers the accuracy is about 10ms, to account for errors in the transmitter and mutability in atmospheric conditions. Some computers are with UTC receivers and they can transmit time information on the internet to other computers.

Given a computer that has a local physical clock that has drift d (i.e., it will have a maximum error d after one second), if we want it to have an error betweek the clocks of two machines (i.e. the difference, called skew, between the hardware clocks of the two systems at one instant] that is never greater than e, we can do it by requesting a UTC time server for the correct time every e/2d seconds (the factor 2 accounts for the fact that the drift can slow down or speed up the local physical clock). The Christian's Algorithm implements this idea. Here are two potential problems with this algorithm, together with their solution.

The UTC time delivered to a computer may be smaller than the local time. We cannot reset the local time to the lower UTC since then certain times would be repeated (this would be a big problem for logs). We deal with this by slowing appropriately the local clock.
The UTC time that is received is incorrect because of propagation delay in the delivery of the signal, and because of the overhead in processing the signal. Since we know the Round Trip Time to+from the server (from the local clock), we can adjust appropriately the received UTC time by adding to it RTT/2.

A number of other algorithms are used for computing synchronized physical clocks. They are based on various forms of averaging between the times held by various computers and on the use of a number of time servers. The most important time service available is based on the Network Time Protocol (NTP) and involves a number of servers.

The are an unlimited number of uses for synchronized physical clocks and for logical clocks. Among them,

For physical clocks: at-most-once message delivery and clock-based cache consistency, (see Tanenbaum, pg 132)
For logical clocks: scheduling, deadlock prevention, and consistent checkpoints.

Logical Clocks

We would like to order the events occurring in a distributed system in such a way as to reflect their possible causal connections. Certainly if an event A happens before an even B, then A cannot have been caused by B. In this situation we cannot say that A is the direct cause of B, but we cannot exclude that A might have influenced B. We want to characterize this "happened-before" relation on events. We will consider only two kinds of events, the sending of a message and the receiving of a message.

If events e1 and e2 occur in the same system and e1 occurs before e2 (there is no problem to determine this in a single system) then e1 happened-before e2, written e1 -> e2
If event e1 is the sending of a message and e2 is the receiving of that message, then e1 happened-before e2
If e1 -> e2 and e2 -> e3 then e1 -> e3

The -> relation is a partial order. Given events e1 and e2 it is not true that either they are the same event or one happened-before the other. These events may be unrelated. Events that are not ordered by happened-before are said to be concurrent.

This characterization of happened-before is unsatisfactory since, given two events, it is not immediate (think of the time it takes to evaluate a transitive relation) to determine if one event happened before the other or if they are concurrent. We would like a clock C that applied to an event returns a number so that the following holds:

(1) If e1 -> e2 then C(e1) < C(e2)

We can define one such C, a logical clock (due to Lamport), as follows:

On each computer i we start with a clock Li set at 0
If e is the sending of a message, then increment the local clock Li, set C(e) to Li and timestamp the message with Li
If e is the receiving at node i of a message with timestamp t, then set the local clock and C(e) to max{t, Li}+1

[We could have started the local clocks with their values set to the same value greater or equal to 0 and using as increments, instead of 1, any value greater than 0. The resulting time still would satisfy (1)] Here is an example showing events a,b,c,d,e,f,g,h,i,j and the corresponding local clocks:

Since the method we have chosen allows different events (on different machines) to have the same logical clock value, we extend the clock value with a number representing the machine where the event occurred [i, C(e)] and order these pairs according to the rule:

[i,C(e1)] < [j,C(e2)] iff C(e1) < C(e2) or [C(e1)=C(e2) and i<j]

Here is an example showing the same events but now tagged with numbers that totally order them:

Now that events are totally ordered, we could, for instance, maintain separate logs at each system, and then later combine the logs in a unique way. Also, we can use these timestamps to totally ordering multicast messages (i.e. all systems receive the multicasts in the same order) as we will see later.

Vector Clocks

A problem of the logical clocks considered above is that they do not represent faithfully the happened-before relation. Namely, while it is true that (1) holds, it is not true that

(2) If C(e1) < C(e2) then e1 -> e2

It would be nice to have logical clocks that would allow us to answer precisely questions of concurrency and dependency. We can do it by using clock values that are vectors of numbers, one for each computer system and then comparing vector values v and u with the rule

u < v iff ((for all components i, u[i]is less or equal to v[i]) and (there is a j such u[j]<v[j]))

Here are the rules on vector clocks in a system with n computers:

Each computer starts with a local clock set at [0,0,..,0]
When on computer i there is a sending event, increment the ith component of the clock by 1 leaving other components unchanged, then tag both the event and the message with this value
When on computer i there is a receiving event, form a new local clock value taking the componentwise maximum of the local clock and the time stamp on the arriving message. Increment the ith component by 1. Finally tag the event with this value.

With these rules, in the vector clock v of an event e, V(e), at machine i, v[i] denotes the number of events that occurred at machine i up to e included, and for j, j!=i, v[j] denotes the number of events that occurred at machine j that are also known at machine i.

Here is our old example using vector clock values.

With vector clocks we have the property

e -> f if and only if V(e) < V(f)

Of course nothing is perfect. Though vector clocks allow to represent faithfully the happened before relation between events, they result in big time stamps and requires that we know in advance the number and identities of the sites involved.
Vector clocks can be totally ordered by ordering events with concurrent vector clocks on the basis of the machine where they occur.

All logical clocks, whether scalar or vector, suffer of a basic problem: though they represent that an even occurred (or not) before another, they do not represent before by how much, that is, there is no measure of elapsed time. In particular we cannot use logical clocks for timeouts, we need instead a wallclock (i.e. an old fashioned clock, possibly synchronized).
Another problem of logical clocks, even vector clocks, is that they cannot account for information exchanged outside of the computer systems, say because user A at system 1 receives output from the system and says that information to user B at system 2. And now user B enters that information at system 2.

Messages with FIFO Property and the Causal Ordering of messages

If we send two datagrams from host n1 to host n2 using the IP or UDP protocol we are not guaranteed that they will be received in the order they were sent. If instead we send data using the TCP protocol we are guaranteed that it will be received and it will be received in the order it was sent. Let's use the following notation, given a message m, SEND(m) is the event of sending the message m, and RECEIVE(m) is the event of receiving the message m. We say that a messaging system satisfies the FIFO property if: when m1 and m2 are messages sent from the same source S to the same destination D and we have SEND(m1) -> SEND(m2) then RECEIVE(m1) -> RECEIVE(m2). When this property is satisfied we have a natural ordering (total) for messages from S to D. The FIFO property is normally used in the context of multicast messages, where we want to be sure that messages sent from the same source are delivered everywhere in the same order.

We generalized FIFO ordering as follows:

Causal Ordering of Messages: m1 -> m2 iff (SEND(m1) -> SEND(m2)) implies !(RECEIVE(m2) -> RECEIVE(m1))

In going from the FIFO to the causal ordering we have dropped the requirement that messages come from the same source and that the receptions are related.
Note that the situation shown below in the diagram (A) is forbidden in a messaging system with the FIFO property, and the situation in diagram (B) is forbidden in a messaging system with the causal property, while the situation in diagram (C) is a causal ordering.

TCP assures the FIFO property on a single connection, but does not guaranty the causal property across multiple connections. Diagram (B) above is quite possible when using TCP.

Here is another time diagram, which we call the lost client diagram.

Interpret the message m1 = (a,j) to mean "Give responsibility for client x to machine M2"
Interpret the message m2 = (b,c) to mean "Who is responsible for x?"
Interpret the message m3 = (d,e) to mean "Machine M2 is responsible for x"
Interpret the message m4 = (f,g) to mean "Connect me to x"
Interpret the message m5 = (h,i) to mean "I don't know x"
Thus at i machine M3 is in a state of confusion about the whereabouts of x.
This is clearly an undesirable situation.
Here are the vector clocks for the various events:
a = [1,0,0]	b = [0,0,1]	c = [2,0,1]	d = [3,0,1]	e = [3,0,2]
f = [3,0,3]	g = [3,1,3]	h = [3,2,3]	i = [3,2,4]	j = [3,3,3]

The lost client diagram does not satisfy the causal property: though a = SEND(m1) -> d = SEND(m3) we have e = RECEIVE(m3) -> j = RECEIVE(m1).

Making a multicast messaging system satisfy the causal property

Algorithms are available for converting a multicast messaging system without the causal property into one where the property is satisfied. One proceeds as follows.
The vector clock is computed to take into account only the sending of messages, not their reception (i.e. the local component of the vector clock is not incremented when a message is received). Thus a vector clock V on machine k will indicate at position k the number of messages sent by machine k, and at every other position j it will indicate the highest message received at k from j.
When node k with local vector clock v receives from node j a message m with vector timestamp u then the message will be delivered to the application when the following Delivery Rule holds:

u[j] = v[j]+1, i.e. this message represents the first event at j occurring after the last j event known at k. The local clock v is update at position j to u[j].
For all i != j, u[i] <= v[i], that is, node k knows now about events at other nodes at least as much as it was known at j when it sent the message m. v[i] is not updated.

In diagram (B) above we see that the message at event f on machine 3 has timestamp [1,0,0] while the last event e on machine 3 had timestamp [2,1,0]. Thus we know that the message received at f was sent before the sending of the message received at e. This does not help us directly since by now the message at e has been accepted. But if we require that all messages be multicast to all participants, then each machine will be able to detect if a message is received out of order as shown in the following graph

now the message at e is received with timestamp [2,0,0] while the local clock is [0,0,0] thus we know that an event on machine 1 that preceded the sending of this message was not received on machine 3. Consequently we can postpone at machine 3 the delivery of the e message until after the f event has occurred.

Causal ordering of multicast messages guaranties that messages sent by the same system will be delivered everywhere in the same order they were sent. More in general if SEND(m1) -> SEND(m2) then m1 will be delivered everywhere before m2. But if SEND(m1) and SEND(m2) are concurrent, then there is no constraint on how these messages are delivered, so in one system m1 may arrive first, in another m2 may arrive first.

Totally Ordered Multicast

Here is an example from the textbook: A database with information on bank accounts is replicated in SF and NY. A teller in SF adds $100 to an account currently with $1000. A teller in NY simultaneously increases that account with a 1% interest. Depending on the order these updates are done the result is $1110 or $1111. [So NY may have $1110 and LA $1111 or viceversa. And both results will satisfy the Causal Ordering constraint.] Certainly we cannot live with a system where the updates can be done in different order at the replicated sites. The problem can be solved with totally ordered multicast: messages are delivered to all sites and delivery order is guaranteed to be the same at all sites.

Totally ordered multicasts can be implemented as follows: each site is connected to all others with 1-1 links. Each link has two properties: 1) it is reliable 2) messages are delivered in the order they were sent (FIFO property) . For instance TCP connections have these properties. [Alternatively the sites may use a form of multicast that satisfies those same two properties.] The sites use totally ordered logical clocks (say, vector + machine order to break ties). The sites keep queues.

Each message, time stamped, is sent to all sites (multicast), including self.
When a message is received, an acknowledgement is sent, time stamped with the same time as the message, to all sites, including self, and the message is placed in the local queue.
A message that is in the queue and has been acknowledged by all the sites is said to be ready. Ready messages are extracted from the queue when not later that any non-ready message in the queue. They are then applied in the order of their clocks.

Does this algorithm work? Let's first observe that when a message m becomes ready at a site S, m has been received by all the sites since they have given an acknowledgement for m that was received at S. Second, if n is a message originated from the same site as m and sent before m, then everywhere n is received before m and the acknowledgements for n are received everywhere before the acknowledgements for m (this is so because of the properties assumed for links) thus n will be be ready before m and will be delivered to the application (be extracted from the queue) before m. So for each message originating site we need to worry only about one message from that site (the other messages from that site will be naturally delayed relative to the first one) If message n originates from a different site than m the argument is harder to make. It is possible that m and n are delivered in different order at different sites. But it is certain that before making the final decision to remove from queue either m or n, that both m and n and one acknowledgement of either m or n will be received at each site thus allowing the algorithm to compare clocks and possibly delay dequeueing in order to respect the clock total order. That is, in all cases, if n precedes m in the total order then n will become ready before m. Thus messages are delivered to the application in their clock order.
The following diagram illustrates the situation. Machine M2 sends message m and machine M3 sends message n. To simplify the diagram we show only the messages delivered to M1 and M4. The acknowledgements are indicated with a broken line and a primed name.

If m and n are delivered in different order at M1 and M4, m before n at M1 and n before m at M4, then the delivery of m' at M1 will follow the delivery of n' at M1. If n follows m in the total order of messages, then n will be delayed because will know about m. Similarly when m becomes ready at M4 it will know about n and will be delayed if it follows n in the total order of messages.
In the specific ordering displayed above, at M1 n will be delayed when n' is received until m' is received, at which point both m and m are ready and m will be acted on before n. At M4 instead when m' is received it becomes ready and can be applied.
In conclusion messages are applied everywhere in the order established by their totally ordered vector clocks.

You may find interesting to see in the following diagram all the messages generated for a single message m originated at M2.

The message m becomes ready at M1, .., M4 respectively at a, .., d. Notice that if there are n agents for each original message we will need n-1 multicast messages or (n-1)*(n-1) point to point messages!

We can achieve the same effect of delivering multicast messages in the same order at all sites at the price of using a centralized controller. Now all updates are sent to a single controller, then that controller sends multicasts to all replicated sites. Of course this requires again that communication channels are reliable and with the FIFO property.

Consistent Cuts

Important is also the concept of Consistent Cut, that is a cut across the timelines of the computers in the distributed system, such that there is no event a at computer i that is after the cut, and event b at computer j that is before the cut, yet a happened-before b. For example in the diagram below, the first, second, and third cuts are consistent, while the fourth is not.

Thus in a consistent cut we forbid the situation where we see an effect (the arrival of a message) without seeing its cause (the sending of the message). Note that in a consistent cut it is possible to have the sending of a message before a cut, but the receiving of that message after the cut. That means that when considering consistent cuts we have also to worry about and save the 'in transit' messages, i.e. messages sent before the cut but received after the cut.
A consistent cut corresponds to a state in the distributed system which can be explained on the basis of the events that occurred up to that time, there is no evidence in one system of something caused in another system, and the cause has not yet been seen. This is necessary for reasonable debugging and for all situations where we want to be able to observe the evolution of the state of a distributed system. And, most importantly, to create checkpoints for distributed computations.
If in a transaction we transfer $100 from account A to account B, there is a time when the $100 are out of A but not yet in B (or viceversa). The money is like "in transit", or, this is the preferred term, "in the [communication] channel". When we look for consistent cuts (or states), we look for situations where either nothing is in the channel, or where we know what is in the channel.

Chandy and Lamport have an algorithm for computing consistent global states (also called snapshots).

A system decided to start a snapshot: it saves its state and sends a Marker out through its outgoing links. It will also save the messages it receives on an in link until it receives a marker on that link.
When a system receives a marker for the first time, it saves its state and sends the marker out through its outgoing links. It will also save the messages it receives on an in link until it receives a marker on that link.
The algorithm terminates when each node has received markers through all its in links. The global state consists of the state saved at the nodes plus the messages stored by the nodes.

The Chandy-Lamport algorithm allows each node in a distributed system to collect locally information in a way that guaranties that the totality of the information collected represents a consistent global state. But it does not say how to deliver to each (or some) node this total information. This is the Global State Collection Problem. It has the following solution:

   int i; // The identity of the current node
   State s[i]; // The information collected at node i in Chandy-Lamport
   Set V[i]; // initially V[i] = {s[i]}; It is the info collected at i
   Set W[i]; // initially W[i] is empty; It is the info sent from i

   while (!terminationCondition) {
      if (V[i] != W[i]) then send V[i]-W[i] on all outgoing arcs
      if (arc from j to i contains message M) then V[i] = V[i] union M
   }

   Terminate when V[i]==W[i] for all i and all arcs from any i to any j
   are empty. But how can we determine if these conditions are true?
   We need a termination detection algorithm.
   One such algorithm is the Dijkstra-Scholten algorithm.

And here is the Dijkstra-Scholten Termination Detection Algorithm

At each node k (we assume that k=0 is the node that starts the algorithm and it has no predecessor) it is run the 
following algorithm:
	Variables:
		C: integer; // Represents the number of messages received by node
			    // and not acknowledged
		D: integer; // Represents the number of messages sent by node 
			    // and not acknowledged
		M: {signal, ack}; // value of a message sent on a link
		sender: integer; // The sender of a message
		parent: integer; // The identity of a node - used to build a 
			    // spanning tree
	begin
		C = D = 0;
		parent = k;
		if (k == 0) {
		    c = 1;
		    for each successor {
			send signal message to that successor;
			D++;
		    }
		}
		while (true) {
		    if (there is a message on a link) {
			set M to the message and sender to its sender;
			if (M == signal) {
			    if (C == 0) {
				C = 1;
				parent = sender;
				for each successor {
				    send signal message to that successor;
				    D++;
		    		}
			    } else if (C == 1) {
				send ack message to sender
			    }
			} else { // The message is an ack
			    D--;
			}
		    }
		    if (C == 1 && D == 0) {
			send ack to parent;
			C = 0;
			parent = k;
			if (k == 0)
			    The algorithm has terminated
		    }
		}
	end

Here is an example:

We show a possible evolution of the algorithm. In the following table we state for each node the value of the parent, C, and D

0 1 2 3 4 5 6 7 8

0,0,0 1,0,0 2,0,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0 8,0.0

0,1,1 1,0,0 2,0,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0 8,0.0

0,1,1 0,1,1 2,0,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0 8,0.0

0,1,1 0,1,1 1,1,2 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0 8,0.0

0,1,1 0,1,1 1,1,2 2,1,1 4,0,0 5,0,0 6,0,0 2,1,2 8,0.0

0,1,1 0,1,1 1,1,2 2,1,1 3,1,1 5,0,0 6,0,0 2,1,1 7,1,1

0,1,1 0,1,1 1,1,2 2,1,1 3,1,1 4,1,1 8,1,2 2,1,1 7,1,1

0,1,1 0,1,1 1,1,2 2,1,1 3,1,1 5,0,0 6,0,0 2,1,1 7,1,0

0,1,1 0,1,1 1,1,2 2,1,1 4,0,0 5,0,0 6,0,0 2,1,0 8,0,0

0,1,1 0,1,1 1,1,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0 8,0,0

0,1,1 0,1,0 2,0,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0 8,0,0

0,1,0 1,0,0 2,0,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0 8,0,0

0,0,0 1,0,0 2,0,0 3,0,0 4,0,0 5,0,0 6,0,0 7,0,0 8,0,0

0	1	2	3	4	5	6	7	8
0,0,0	1,0,0	2,0,0	3,0,0	4,0,0	5,0,0	6,0,0	7,0,0	8,0.0
0,1,1	1,0,0	2,0,0	3,0,0	4,0,0	5,0,0	6,0,0	7,0,0	8,0.0
0,1,1	0,1,1	2,0,0	3,0,0	4,0,0	5,0,0	6,0,0	7,0,0	8,0.0
0,1,1	0,1,1	1,1,2	3,0,0	4,0,0	5,0,0	6,0,0	7,0,0	8,0.0
0,1,1	0,1,1	1,1,2	2,1,1	4,0,0	5,0,0	6,0,0	2,1,2	8,0.0
0,1,1	0,1,1	1,1,2	2,1,1	3,1,1	5,0,0	6,0,0	2,1,1	7,1,1
0,1,1	0,1,1	1,1,2	2,1,1	3,1,1	4,1,1	8,1,2	2,1,1	7,1,1
0,1,1	0,1,1	1,1,2	2,1,1	3,1,1	5,0,0	6,0,0	2,1,1	7,1,0
0,1,1	0,1,1	1,1,2	2,1,1	4,0,0	5,0,0	6,0,0	2,1,0	8,0,0
0,1,1	0,1,1	1,1,0	3,0,0	4,0,0	5,0,0	6,0,0	7,0,0	8,0,0
0,1,1	0,1,0	2,0,0	3,0,0	4,0,0	5,0,0	6,0,0	7,0,0	8,0,0
0,1,0	1,0,0	2,0,0	3,0,0	4,0,0	5,0,0	6,0,0	7,0,0	8,0,0
0,0,0	1,0,0	2,0,0	3,0,0	4,0,0	5,0,0	6,0,0	7,0,0	8,0,0

If one gives up on a fully distributed approach, and allows for a checkpoint coordinator (or two for reliability), then the complexity of the solution is dramatically reduced by having each node report its state and transit messages to the coordinator(s).