We would like to order the events occurring in a distributed system in such a way as to reflect their possible causal connections. Certainly if an event A happens before an even B, then A cannot have been caused by B. In this situation we cannot say that A is the direct cause of B, but we cannot exclude that A might have influenced B. We want to characterize this "happened-before" relation on events. We will consider only two kinds of events, the sending of a message and the receiving of a message.
The -> relation is a partial order. Given events e1 and e2 it is not true that either they are the same event or one happened-before the other. These events may be unrelated. Events that are not ordered by happened-before are said to be concurrent.
This characterization of happened-before is unsatisfactory since, given two events, it is not immediate (think of the time it takes to evaluate a transitive relation) to determine if one event happened before the other or if they are concurrent. We would like a clock C that applied to an event returns a number so that the following holds:
Since the method we have chosen allows different events (on different machines) to have the same logical clock value, we extend the clock value with a number representing the machine where the event occurred [i, C(e)] and order these pairs according to the rule:
Now that events are totally ordered, we could, for instance, maintain separate logs at each system, and then later combine the logs in a unique way. Also, we can use these timestamps to totally ordering multicast messages (i.e. all systems receive the multicasts in the same order) as we will see later.
Here is our old example using vector clock values.
With vector clocks we have the property
All logical clocks, whether scalar or vector, suffer of a basic problem:
though they represent that an even occurred (or not) before another, they do not
represent before by how much, that is, there is no measure of elapsed
time. In particular we cannot use logical clocks for timeouts, we need instead
a wallclock (i.e. an old fashioned clock, possibly synchronized).
Another problem of logical clocks, even vector clocks, is that they cannot account
for information exchanged outside of the computer systems, say because user A at system 1
receives output from the system and says that information to user B at system 2. And now user B
enters that information at system 2.
We generalized FIFO ordering as follows:
TCP assures the FIFO property on a single connection, but does not guaranty the causal property
across multiple connections. Diagram (B) above is quite possible when using TCP.
Here is another time diagram, which we call the lost client diagram.
Interpret the message m1 = (a,j) to mean "Give responsibility for client x to machine M2" Interpret the message m2 = (b,c) to mean "Who is responsible for x?" Interpret the message m3 = (d,e) to mean "Machine M2 is responsible for x" Interpret the message m4 = (f,g) to mean "Connect me to x" Interpret the message m5 = (h,i) to mean "I don't know x" Thus at i machine M3 is in a state of confusion about the whereabouts of x. This is clearly an undesirable situation. Here are the vector clocks for the various events: a = [1,0,0] b = [0,0,1] c = [2,0,1] d = [3,0,1] e = [3,0,2] f = [3,0,3] g = [3,1,3] h = [3,2,3] i = [3,2,4] j = [3,3,3]The lost client diagram does not satisfy the causal property: though a = SEND(m1) -> d = SEND(m3) we have e = RECEIVE(m3) -> j = RECEIVE(m1).
now the message at e is received with timestamp [2,0,0] while the local clock is [0,0,0] thus we know that an event on machine 1 that preceded the sending of this message was not received on machine 3. Consequently we can postpone at machine 3 the delivery of the e message until after the f event has occurred.
Causal ordering of multicast messages guaranties that messages sent by the same system will be delivered everywhere in the same order they were sent. More in general if SEND(m1) -> SEND(m2) then m1 will be delivered everywhere before m2. But if SEND(m1) and SEND(m2) are concurrent, then there is no constraint on how these messages are delivered, so in one system m1 may arrive first, in another m2 may arrive first.
Totally ordered multicasts can be implemented as follows: each site is connected to all others with
1-1 links. Each link has two properties: 1) it is reliable 2) messages are delivered in the order
they were sent (FIFO property) . For instance TCP connections have these properties.
[Alternatively the sites may use a form of multicast that satisfies those same two
properties.]
The sites use
totally ordered logical clocks (say, vector + machine order to break ties). The sites keep queues.
If m and n are delivered in different order at M1 and M4, m before n at M1 and n before m at M4,
then the delivery of m' at M1 will follow the delivery of n' at M1. If n follows m in the total order
of messages, then n will be delayed because will know about m. Similarly when m becomes ready at M4
it will know about n and will be delayed if it follows n in the total order of messages.
In the specific ordering displayed above, at M1 n will be delayed when n' is received until m'
is received, at which point both m and m are ready and m will be acted on before n. At M4
instead when m' is received it
becomes ready and can be applied.
You may find interesting to see in the following diagram
all the messages generated for a single message m originated at M2.
The message m becomes ready at M1, .., M4 respectively at a, .., d. Notice that if there are n agents for each original message we will need n-1 multicast messages or (n-1)*(n-1) point to point messages! But all these messages will have the same time stamp as the original m.
We can achieve the same effect of delivering multicast messages in the same order at all sites at the price of using a centralized controller. Now all updates are sent to a single controller, then that controller sends multicasts to all replicated sites. Of course this requires that communication channels are reliable and with the FIFO property.
Thus in a consistent cut we forbid the situation where we see an effect (the arrival of a message)
without seeing its cause (the sending of the message).
Note that in a consistent cut it is possible to have the sending of a message
before a cut, but the receiving of that message after the cut. That means that
when considering consistent cuts we have also to worry about and save the
'in transit' messages, i.e. messages sent before the cut but received after the cut.
A consistent cut corresponds to a state
in the distributed system which can be explained on the basis of the events that
occurred up to that time, there is no evidence in one system of something
caused in another system, and the cause has not yet been seen. This is necessary for
reasonable debugging and for all situations where we want to be able
to observe the evolution of the state of a distributed system. And, most
importantly, to create checkpoints for distributed computations.
If in a transaction we transfer $100 from account A
to account B, there is a time when the $100 are out of A but not yet
in B (or viceversa). The money is like "in transit", or, this is the
preferred term, "in the [communication] channel". When we look for
consistent cuts (or states), we look for situations where either
nothing is in the channel, or where we know what is in the channel.
Chandy and Lamport have an algorithm for computing consistent global states (also called snapshots).
The Chandy-Lamport algorithm allows each node in a distributed system to collect locally information in a way that guaranties that the totality of the information collected represents a consistent global state. But it does not say how to deliver to each (or some) node this total information. This is the Global State Collection Problem. It has the following solution:
int i; // The identity of the current node
State s[i]; // The information collected at node i in Chandy-Lamport
Set V[i]; // initially V[i] = {s[i]}; It is the info collected at i
Set W[i]; // initially W[i] is empty; It is the info sent from i
while (!terminationCondition) {
if (V[i] != W[i]) then send V[i]-W[i] on all outgoing arcs
if (arc from j to i contains message M) then V[i] = V[i] union M
}
Terminate when V[i]==W[i] for all i and all arcs from any i to any j
are empty. But how can we determine if these conditions are true?
We need a termination detection algorithm.
One such algorithm is the Dijkstra-Scholten algorithm.
And here is the Dijkstra-Scholten Termination Detection Algorithm
At each node k (we assume that k=0 is the node that starts the algorithm and it has no predecessor) it is run the
following algorithm:
Variables:
C: integer; // Represents the number of messages received by node
// and not acknowledged
D: integer; // Represents the number of messages sent by node
// and not acknowledged
M: {signal, ack}; // value of a message sent on a link
sender: integer; // The sender of a message
parent: integer; // The identity of a node - used to build a
// spanning tree
begin
C = D = 0;
parent = k;
if (k == 0) {
c = 1;
for each successor {
send signal message to that successor;
D++;
}
}
while (true) {
if (there is a message on a link) {
set M to the message and sender to its sender;
if (M == signal) {
if (C == 0) {
C = 1;
parent = sender;
for each successor {
send signal message to that successor;
D++;
}
} else if (C == 1) {
send ack message to sender
}
} else { // The message is an ack
D--;
}
}
if (C == 1 && D == 0) {
send ack to parent;
C = 0;
parent = k;
if (k == 0)
The algorithm has terminated
}
}
end
Here is an example:
We show a possible evolution of the algorithm. In the following table we state for each node the value of the parent, C, and D
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|
| 0,0,0 | 1,0,0 | 2,0,0 | 3,0,0 | 4,0,0 | 5,0,0 | 6,0,0 | 7,0,0 | 8,0.0 |
| 0,1,1 | 1,0,0 | 2,0,0 | 3,0,0 | 4,0,0 | 5,0,0 | 6,0,0 | 7,0,0 | 8,0.0 |
| 0,1,1 | 0,1,1 | 2,0,0 | 3,0,0 | 4,0,0 | 5,0,0 | 6,0,0 | 7,0,0 | 8,0.0 |
| 0,1,1 | 0,1,1 | 1,1,2 | 3,0,0 | 4,0,0 | 5,0,0 | 6,0,0 | 7,0,0 | 8,0.0 |
| 0,1,1 | 0,1,1 | 1,1,2 | 2,1,1 | 4,0,0 | 5,0,0 | 6,0,0 | 2,1,2 | 8,0.0 |
| 0,1,1 | 0,1,1 | 1,1,2 | 2,1,1 | 3,1,1 | 5,0,0 | 6,0,0 | 2,1,1 | 7,1,1 |
| 0,1,1 | 0,1,1 | 1,1,2 | 2,1,1 | 3,1,1 | 4,1,1 | 8,1,2 | 2,1,1 | 7,1,1 |
| 0,1,1 | 0,1,1 | 1,1,2 | 2,1,1 | 3,1,1 | 5,0,0 | 6,0,0 | 2,1,1 | 7,1,0 |
| 0,1,1 | 0,1,1 | 1,1,2 | 2,1,1 | 4,0,0 | 5,0,0 | 6,0,0 | 2,1,0 | 8,0,0 |
| 0,1,1 | 0,1,1 | 1,1,0 | 3,0,0 | 4,0,0 | 5,0,0 | 6,0,0 | 7,0,0 | 8,0,0 |
| 0,1,1 | 0,1,0 | 2,0,0 | 3,0,0 | 4,0,0 | 5,0,0 | 6,0,0 | 7,0,0 | 8,0,0 |
| 0,1,0 | 1,0,0 | 2,0,0 | 3,0,0 | 4,0,0 | 5,0,0 | 6,0,0 | 7,0,0 | 8,0,0 |
| 0,0,0 | 1,0,0 | 2,0,0 | 3,0,0 | 4,0,0 | 5,0,0 | 6,0,0 | 7,0,0 | 8,0,0 |
Given a computer that has a local physical clock that has drift d (i.e., it will have a maximum error d after one second), if we want it to have an error betweek the clocks of two machines (i.e. the difference, called skew, between the hardware clocks of the two systems at one instant] that is never greater than e, we can do it by requesting a UTC time server for the correct time every e/2d seconds (the factor 2 accounts for the fact that the drift can slow down or speed up the local physical clock). The Christian's Algorithm implements this idea. Here are two potential problems with this algorithm, together with their solution.
The are an unlimited number of uses for synchronized physical clocks and for logical clocks. Among them,