CIS 307: Group Communication and Replication

Pages 99 - 115 in Tanenbaum - Distributed Operating Systems

We discuss two seemingly unrelated topics, Group Communication and Data and/or Service Replication. Group communication is often used to achieve Replication. We will not discuss how groups are created and maintained, nor we will worry about possible relations between groups.

In Group Communication, contrary to Point-to-Point Communication, we have one transmitter and many receivers: all the members of the group. A one-to-many message can be realized

The ethernet and the IP protocol support multicasting (1 as a leading bit of an ethernet address and Class D IP addresses).

Properties and relations among group messages

A group multicast system is said to be reliable if a multicast message is received by all nodes in the group or by none. [Some call such a system atomic, while others reserve the word atomic for more restrictive systems (reliable with totally ordered messages).]

A unicast message m involves two events, the sending of the message, s(m), and the reception of the message r(m). In the case of a multicast message m, we have the sending of the message s(m) and, for each node i in the group, the reception of the message at node i, r(m,i). Clearly s(m) -> r(m,i) for all nodes i in the multicast group. We say that a communication system enforces FIFO ordering for messages if for any two messages m1 and m2 sent by the same node we have:

   if s(m1) -> s(m2) then for all i in the group r(m1,i) -> r(m2,i)
In TCP fragments of a message may be delivevered out of order to the receiver system, but TCP at this system reorders the fragments and delivers them in the right order to the receiver. Thus TCP is with FIFO ordering of the fragments within a message. (And of course TCP is FIFO for successive messages.) TCP can do that since the fragments of the message are stamped with offset information.

We say that a communication system enforces the causal ordering of messages if for any two messages m1 and m2,

   if s(m1) -> s(m2) then for all i in the group r(m1,i) -> r(m2,i)
If a system enforces causal ordering then it also enforces FIFO ordering, but not viceversa. It is FIFO since all arrivals of m1 occur before the correspondent arrivals of m3. It is not causal since s(m1) happens before s(m2), but r(m2,2) happens before r(m1,2).

Here is an example of where we have no causal ordering: I send a letter with a deposit to a bank, later I send a letter with a check. The check gets cashed before my deposit is received by the bank. Clearly the mail system does not preserve the causal ordering of messages. We want to delay the delivery of messages that are out of order until all the causally preceding messages have been delivered. Here is an algorithm that enforces causal ordering:

  CBCAST Algorithm (ISIS - Birman + Joseph):
  We are given nodes M1, M2, .., Mn.
  Each node Mi maintains a vector [S1,S2,..,Sn] where Sj indicates that
  messages 1, 2, .., Sj have been received from node Mj.
  When node Mi sends a message, it will time stamp it with the vector
  [S1,..,Si+1,..,Sn].

  When node Mj with vector [S1,S2,..,Sn] receives a message from Mi
  with time stamp [T1,T2,..,Tn]

  1. We accept the message if
       (1) Ti = Si+1 and
       (2) for all k different from i, Tk <= Sk
     Where (1) means that this is the next message from Mi to reach Mj, and
     (2) means that the sender does not know more about other machines
     than the receiver.

  2. We hold and delay the message if
       (3) Ti > Si+1 or
       (4) Ti = Si+1 and there is a k such that Tk > Sk
     Where (3) means that there is an earlier pending message from Mi
     and (4) means the sender has received messages not yet delivered to the
     receiver.

  3. We discard the message if
       (5) Ti <= Si
     Where (5) means that we are dealing with a duplicate message.

We say that a communication system enforces the total ordering of messages if for any two messages m1 and m2, m1 and m2 are delivered in the same order at all nodes. Note that this definition does not guaranty causal order. In fact in the following diagram we have messages that are trivially causally ordered (since the sending of m1 and of m2 are not ordered) but not totally ordered since some of the arrivals of m1 happen before some arrivals of m2 and some after.

The example said to be totally ordered but not causally ordered it is so since (1) for the machines where there are arrivals from both m1 and m2 we have all the arrivals of m2 before the correspondent arrivals of m1 (total order); but (2) s(m1) happens before s(m2) yet r(m2,3) happens before r(m1,3) (no causal order).

One way of implementing total ordering of messages is to have a sequencer node. Then when a node wants to send a message it first requests a sequence number from the sequencer, then multicasts the message timestamped with that number. Each node keeps track of the highest number received. A message when it is received it is delayed if its sequence number is greater than 1 plus the local number, it is accepted if it is equal to 1 plus the local number, and it is discarded if it is less. The total ordering is also a causal ordering if the reception of the sequence number and the sending of the message marked with that number is done as an indivisible action. [The use of a sequencer brings about all the problems we have seen when discussing the use of a coordinator for mutual exclusion. But distributed solutions (ABCAST algorithm) are so complex and time consuming that they are avoided.]

An even stronger constraint on a communication system than (causal) total ordering is to require that it be ordered, causal, and that the sending of a multicast message from a node be interpreted as the reception of that same message from that node (in other words, all the events associated with a message happen before or all happen after the events associated to another message). Such a system is said to be synchronous. In a synchronous system it is as if the sending and delivery of a message were instantaneous and we can divide time at each machine at intervals demarked by the sending/delivery of messages.

Replication

We use group communication for distribution of information to a group of subscribers (for example, brokers following changes on certain stocks; or a number of users interacting through a whiteboard where they simultaneously read and write; or a bulletin board; or a chat room; ..) and to achieve replication of information. In fact we can think of the former as a form of the latter. So we will focus on the latter.

Data and/or processing power may be replicated at a number of sites for a number of reasons:

  1. Improved Performance: Consider the use of caches. Whether as a memory in a computer system, or as a buffer in a disk system, or as local storage in a web browser, a cache is a memory that duplicates information by placing it closer to its points of use, thus enhancing speed of access.
  2. Improved Availability: Assume that important information (for example, the name service information) is replicated at n sites. Then, if p is the probability of failure at a site, the probability of having no service is (1 - p^n).
  3. Fault Tolerance. Here we stress, over the availability aspect, the possibility of coping with partial failures, or unrecognized failures. For example, by having a voting scheme to deal with nodes that may be producing incorrect information.
A problem with duplicate information is inconsistency, the situation where copies may hold different values from each other or from the original site. A way to deal with this problem is to constrain how multicast update messages are delivered. For example, if we want the changes to appear exactly in the same order everywhere, then we should use a synchronous order (this is what we want if we have a whiteboard and people can also talk and thus exchange information as changes are taking place). If we are stock brokers receiving stock information, we want totally ordered delivery so that all brokers see the changes in the same order (if we are with a single feed, FIFO ordering will suffice). In most situations we will be satisfied with causal ordering.

ingargio@joda.cis.temple.edu