Archive-name: os-research/part3 Version: $Revision: 1.3 $ Posting-Frequency: monthly Last-Modified: Tue Aug 13 21:03:20 1996 URL: http://www.serpentine.com/~bos/os-faq/ Answers to frequently asked questions for comp.os.research: part 3 of 3 Copyright (C) 1994--1996 Bryan O'Sullivan TABLE OF CONTENTS 1. Distributed systems 1.1. What is the current status of the (insert name) project? 1.2. How do approaches to load balancing differ? 1.3. Fault tolerance in distributed systems 1.4. Naming in distributed systems 1.5. Distributed shared memory 1.5.1. Data consistency 1.5.1.1. Strictly consistent systems 1.5.1.2. Relaxing consistency 1.5.1.3. Application-specific coherence 1.5.2. Access synchronisation 1.5.3. Transfer and caching granularity 1.5.4. Address space structure 1.5.5. Fault tolerance 1.5.6. A brief bibliography on distributed shared memory 1.6. What have we learned? 2. Needful things ------------------------------ Subject: [1] Distributed systems From: Distributed systems A great deal of the high-profile research carried out in operating systems these days deals with distributed computing. Not surprisingly, discussions of distributed systems make up a large amount of the traffic on comp.os.research. ------------------------------ Subject: [1.1] What is the current status of the (insert name) project? From: Distributed systems See the section on `available software' for information on distributions of some of the systems mentioned here. - The Amoeba project is still going. There are roughly 20 people working on it, but most of these are no longer kernel hackers. They are working on using it for parallel programming, wide-area distributed systems, and other things. Amoeba is used in over 100 universities at the moment, and is also used at commercial institutions. - Brazil is the new research operating system being developed at AT&T Bell Labs. Research topics being addressed in Brazil center on higher-performance machines and, particularly, networks. A new in-house 300 megabit/s switched fiber network increases the potential bandwidth between machines by at least an order of magnitude; our aim is to realize and exploit that bandwidth. The overall design is to eliminate unnecessary overhead, particularly by restructuring and redesigning where necessary to avoid copying data from element to element along the communications path. Most of this software (except the operating system kernel) is written in a new concurrent systems programming language, Alef, which makes it easy to write multi-process servers and applications that can communicate using messages or shared memory, as appropriate. A paper on Alef is available from the Plan 9 ftp site; see part 2 of this FAQ for a pointer. - Cronus is still under development at BBN. The current public release is 3.0. The project currently has two thrusts---as the base for advanced distributed system R&D, and as a platform for constructing and deploying sophisticated distributed applications. Ongoing research topics include the integration of Cronus and Mach technology, the exploration of techniques for the construction of WAN-based and multi-organisational applications, investigation into the integration of distributed systems and network management systems, and work in high-performance distributed computing. - Horus is being developed by the same group that worked on Isis; the head of this group is Robbert van Renesse. - Isis is no longer being developed at Cornell; it is now managed as a commercial product. - Mach is no longer being developed at CMU. Current work on Mach is being carried out by the OSF Research Institute and at the University of Utah. - Plan 9 is no longer in development at AT&T Bell Labs. fibre-optic network. The operating systems research group at Bell Labs has moved on to a new project, called Brazil, which addresses portable computing and distributed applications programming. - QNX is a commercial POSIX-certified realtime OS with an installed base of over 250,000 systems. It is used extensively in process control, factory automation, medical instrumentation, communications and point-of-sale. A number of universities are also doing research with QNX. - The Sprite network operating system project has ended. ------------------------------ Subject: [1.2] How do approaches to load balancing differ? From: Distributed systems Load-balancing policy falls into two broad groups: static and dynamic. Static policies use algorithms which operate without regard to run-time loads across a system, while dynamic policies use the run-time performance of various parts of a system in order to make more `informed' decisions about balancing. [92-11-06-12-53.57] A dynamic load-balancing policy is one which uses run-time state information in making scheduling decisions. There are two kinds of dynamic policies: adaptive and non-adaptive. The latter always use the same (fixed, load-dependent) policy; the former may adjust policy parameters in order to gradually improve their performance. The key point is that while non-adaptive policies use only the information about the run-time state, adaptive policies use, in addition to this, information about current performance. In adaptive policies, the rules for adjusting policy parameters may be static or dynamic. An example of the former might be: `shift to a conservative migration rule when system-wide load patterns are varying too rapidly'. An example of the latter could be: `increase sender-side threshold when migrated jobs cause slowdown rather than speedup'. Some researchers refer to the performance-driven adaptation exhibited by the second policy as `learning'. Since both non-adaptive policies and adaptive policies with static rules really use only load information, it is confusing to distinguish between them. One way to avoid such confusion is to restrict the use of the word `adaptive' to policies that use performance feedback in order to drive their adjustment of policy parameters. ------------------------------ Subject: [1.3] Fault tolerance in distributed systems From: Distributed systems One approach to providing fault tolerance in distributed systems involves the use of redundant services, such that standby facilities can become active in the event of the failure of, or loss of connection to, a primary service. Another approach is to provide multiple paths of connectivity between the computers that make up the distributed system. The QNX system, for example, supports multiple network drivers per node. The purpose of the network connection under QNX is to merge the microkernels on the LAN into a single logical kernel. Hence, if multiple LAN connections per node are present, the networking code can load balance the LAN traffic on the paths available. It can also route around failed links, providing both greater LAN bandwidth and better fault tolerance. See below for treatment of fault tolerance in systems which make use of distributed shared memory. ------------------------------ Subject: [1.4] Naming in distributed systems From: Distributed systems [Material on naming and/or global naming sought.] ------------------------------ Subject: [1.5] Distributed shared memory From: Distributed systems Distributed computer systems have evolved using message passing as their main method of communication. Other communication systems used in loosely coupled distributed systems, such as RPC, are usually implemented on top of an underlying message passing system. On the other hand, in tightly coupled systems, such as a multi-processor machine, the communication method used is usually shared memory. In distributed shared memory (DSM) systems [Nitzberg & Lo, 91], processes share data transparently across node boundaries; data faulting, location, and movement is handled by the underlying system. Among other things, this allows parallel programs designed to use shared memory to execute transparently on a loosely coupled distributed system. While the performance implications cannot be ignored, the advantages of the shared memory programming model are well known: - Shared memory programs are usually shorter and easier to understand than equivalent message passing programs. - Large or complex data structures may easily be communicated. - Shared memory gives transparent process-to-process communication. - Programming with shared memory is a well-understood problem. Shared-memory (or `procedure-oriented') and message-oriented operating systems are, in some sense, equivalent [Lauer & Needham, 78], though it has been claimed that the former are `more powerful' [Tam et al., 90]. ------------------------------ Subject: [1.5.1] Data consistency From: Distributed systems Despite recent advances in both local and wide-area networking technologies, network latency is still a major factor in distributed systems and likely to remain so. All DSM systems provide some sort of caching in an attempt to improve the performance beyond that provided by doing a network access on every reference to a non-local data item. Each system must decide whether or not to attempt to keep the data coherent, and, if so, what coherence strategy to use. The coherence semantics which may be provided to the programmer include: - `strict' consistency, where a read always returns the value written by the most recent write - a `loosely' consistent system where the system enforces some form of weak consistency guarantees and the application (or compiler or user) can indicate synchronisation points where consistency must be enforced; - no automatic consistency mechanism, but provide the user with the facilities necessary to implement user level synchronisation and consistency. ------------------------------ Subject: [1.5.1.1] Strictly consistent systems From: Distributed systems Older, strictly consistent systems tend to enforce a single writer, multiple reader model, where at any time data will be held either at a single node (which may have write access) or several nodes (none of which may have write access). Given this model, we must be able to locate a copy of our data when it is not resident. The method most frequently used is to assign an `owner' to each item of data, where the owner has either the only writeable copy of the data, or one of the read-only copies. Ownership may remain fixed throughout the life of a datum, or it may change dynamically. In the latter case, the problem arises of locating the owner. A database of locations may be maintained by centralised managers, or ownership information can be distributed among nodes of the system [Li and Hudak, 89]. In a strictly consistent system, we must also be able to synchronise writes. The two major solutions to this problem are: - Write broadcast. The effects of every write are broadcast to ever node that has a copy of the data being written; this effectively implements a replication algorithm. Write broadcast is usually considered too expensive to be used as a general solution. - Write invalidation. Each node in the system holding a read-only copy of the data being written is sent an invalidation message. ------------------------------ Subject: [1.5.1.2] Relaxing consistency From: Distributed systems Permitting temporary inconsistencies is a common method of increasing performance in distributed systems. Memory is said to be loosely coherent if the value returned by a read operation is the value written by an update operation to the same object that `could' have immediately preceded the read operation in some legal schedule of the threads in execution [Bennett et al., 90]. Using loose coherence, more than one thread may have write access to the same object, provided that the programmer knows that the writes will not conflict. Another memory consistency model is `release consistency' [Gharachorloo et al., 90], in which memory accesses are divided into ordinary and synchronisation-related accesses. The latter are further divided into `acquire' and `release' operations. The `acquire' operation indicates that shared data is needed, and a processor's updates are not guaranteed to be performed at other nodes until a `release' is performed. The primary advantage of this form of consistency is that it allows consistency updates to be tied to synchronisation events, and therefore to be delayed until actually needed by applications. However, most release consistent systems require the programmer to make explicit use of `acquire' and `release' operations. A DSM system called Midway introduces another new consistency model, `entry consistency' [Bershad et al., 93]. Entry consistency is weaker than many of the other models suggested, including release consistency; it requires explicit annotations to associate synchronisation objects and data. On an `acquire', only the data associated with the synchronisation object is guaranteed to be consistent. This extra weakness permits higher performance implementations of the underlying consistency protocols to be written. Midway also supports stronger consistency models, so that the application programmer can trade-off performance against the extra effort required to write entry consistent programs. ------------------------------ Subject: [1.5.1.3] Application-specific coherence From: Distributed systems >From [Cheriton, 86]: `Problem-oriented shared memory' is a shared memory that implements fetch and store operations specialised to the particular problem or application it is supporting. In particular, a problem-oriented shared memory commonly provides a specialised form of consistency and consistency maintenance that exploits application-specific semantics. Cheriton goes on to propose that consistency constraints be relaxed and more use be made of problem semantics. He suggests that, in some cases, stale data may be detected on use by the client, and the client may then recover. A example would be hint caching. In some applications, stale data may actually be sufficiently accurate, provided that the client can obtain up to date information when necessary. In other applications, some data may be optional in the sense that the client can continue without it. Other applications may tolerate having the results of store operations being lost or undone, for example, an application that regularly updates the entire data set. Another approach is presented by the designers of Munin, where the runtime system accepts hints from the compiler or user to determine the coherence mechanism to be used for each object. The default, in the absence of hints, is to use a general read-write consistency mechanism, much like that employed by IVY. Munin supports several different object types that are based on the results of a survey of shared memory access characteristics. The results of the survey showed that a very small percentage of all accesses to shared data fall under the general read-write type. The Munin designers also note that a program moves through various stages of execution, and the types associated with objects change as time progresses ------------------------------ Subject: [1.5.2] Access synchronisation From: Distributed systems Most parallel applications will use some sort of synchronisation system to order and control accesses to shared data before actually accessing the data. The most important thing to note in DSM systems is that just blindly using standard test and set operations on bytes in shared pages will produce a high fault rate; faults are usually expensive, making this approach unacceptable. Clouds merges locking with the cache consistency protocol, so that the user may obtain both a lock and the data in one network transaction. This system has the advantage that no invalidation messages are required, since the granting of the lock guarantees that there are no conflicting copies; it has the disadvantage that an explicit unlock/discard operation is required to release access to the data. This is acceptable in Clouds, as the DSM system was designed specifically to support object invocation, so it is easy to discard on a return. Munin provides a distributed lock mechanism using `proxy objects' to reduce network load. Proxy objects are maintained by a lock server on each node; when a thread wants to obtain a lock on an object, it attempts to lock the proxy instead. The server obtains the global lock if it is not already held locally. Global locking is done by negotiating with all the other lock servers in the system. Each lock may be migrated from server to server, and part of the Munin system allows objects to be migrated along with their locks. Other systems, such as IVY and Mermaid, use modified versions of classic multiprocessor synchronisation facilities. ------------------------------ Subject: [1.5.3] Transfer and caching granularity From: Distributed systems When caching objects in local memory, it is necessary to decide what level of granularity to use. All current systems use a fixed block size in the cache, rather than varying the granularity based on object size. Usually this is due to constraints imposed by the system hardware and memory management. The choice of the block size in the cache depends on several issues. - Cost of communication: for example, on many local area networks there is little difference between the time required to send a one-byte message and that required to send a 1024-byte message. Transmitting bulk changes rather than single-byte modifications would therefore seem desirable. - The choice of granularity also depends on the locality of reference in the application, as thrashing may occur when two machines are both accessing the same block (this is also known as the `ping-pong effect'). This would seem to argue for a smaller block size. It should be noted that many object-oriented systems exhibit very poor locality of reference. In practice, a compromise must be achieved, as with conventional virtual memory systems. Most systems use a block size which is the same as that of the virtual memory management unit on the system, or a multiple thereof. Among other things, it allows the hardware to be used to help in the maintenance of consistency. The choice is complicated somewhat when heterogeneous machines are being used, but in these cases, the lowest common multiple of hardware supported page sizes can usually be used. The only major system that doesn't use a large block size is Memnet, in which a hardware based DSM system was implemented on a high speed token ring; a 32-byte block size was used instead [Delp & Farber]. The choice of a small block size is appropriate, as the system is much closer to a shared memory multi-processor than it is to a software DSM system. This is because the entire processor is blocked on a cache miss; the processor is not actually aware of the distributed nature of its address space. Also, the ratio between remote and local memory access times is much lower than in the software based systems due to the dedicated token ring (200Mbps) and hardware assistance. ------------------------------ Subject: [1.5.4] Address space structure From: Distributed systems In a single shared address space system, the system appears as a set of threads executing in a shared distributed address space. Objects always appear at the same addresses on all nodes. Single address space systems have had a resurgence in popularity with the arrival of 64-bit processors. A number of researchers believe that a 64-bit address space is large enough to act as a single global address space for all the memory (both primary and secondary) in a distributed system. Examples of such systems include Angel, Mungi, and Opal. Security and protection are a major problem in such systems, and current approaches either rely on hardware assistance or stochastic algorithms, or ignore the problem. Another approach is to divide each process's address space into different fixed regions, some of which are private and not shared, and some of which are shared with some other processes. Ra, the Clouds kernel, takes this approach using O, P, and K address regions, with the O region shared between all processes executing in a given object; the P and K regions are local to a process and kernel, respectively. Here objects always appear at the same address but may not be visible from every address space. By contrast, some systems, including Mirage and Mach, allow shared data to exist at differing addresses in different processes address spaces. However, neither system does transparent pointer translation, so the address changes are not entirely transparent to the application. As for the structuring of the shared region itself, some systems -- for example, IVY and Mether -- use a single flat region: one continuous range of virtual addresses represent the shared address space and are managed by the DSM system. This single address space is usually sub-divided into pages. Most systems use paged segmentation: the shared region consists of disjoint pieces, which are usually managed separately and are not all mapped in any one process. Frequently, the segments (sometimes called memory objects, or windows) are related to the backing store. For example, in Clouds, the object address space consists of windows onto larger segments; these segments are usually maintained on secondary storage. ------------------------------ Subject: [1.5.5] Fault tolerance From: Distributed systems Most DSM systems ignore the fault tolerance issue or maintain that it is an operating system issue and should be handled by the underlying system. However, it would appear that in practice a DSM system would strongly effect the fault tolerance of a system. For example, in a system where several systems are sharing access to a set of data, the failure of any one of them could lead to the failure of all the connected sites (or, at least, some of the processes on each site). We are also presented with an unusual failure handling problem. It is fairly easy to see how to handle a failed message or RPC, but how do you handle a failed page fault? The original Clouds system provided recoverability using shadowing of segments and a transactional system using commits. The recovery system was not really integrated with the DSM system and was merely implemented at the segment storage site. In order to maintain a consistent view of data when one transaction is active at multiple nodes, they have more recently been forced to integrate the transaction system with the DSM support system. ------------------------------ Subject: [1.5.6] A brief bibliography on distributed shared memory From: Distributed systems [Nitzberg & Lo, 1991] Nitzberg, W. and Lo, V., `Distributed shared memory: a survey of issues and algorithms', IEEE Computer, August 91, pp. 52-60 [Lauer & Needham, 1978] [Tam et al., 90] Tam, M.-C., Smith, J. M. & Farber, D. J., `A taxonomy-based comparison of several distributed shared memory systems', ACM Operating Systems Review 24(3), July 90, pp. 40-67 [Li and Hudak, 89] Li, K. & Hudak, P., `Memory coherence in shared virtual memory systems', ACM Transactions on Computer Systems 7(4), November 89, pp. 321-359 [Bennett et al., 90] Bennett, J. K., Carter, J. B. & Zwaenopoel, W., `Munin: distributed shared memory based on type-specific memory coherence', Proceedings of the 2nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, SIGPLAN Notices 25(3), March 90, pp. 168-176 [Gharachorloo et al., 90] Gharachorloo, K., et al., `Memory consistency and event ordering in scalable shared-memory multiprocessors', ACM SIGARCH News 18(2), June 90 [Bershad et al., 93] Bershad, B. N., et al., `The Midway distributed shared memory system', Technical Report CMU-CS-93-119, School of Computer Science, Carnegie Mellon University, 1993. Available via anonymous ftp from . [Cheriton, 86] Cheriton, D. R., `Problem-oriented shared memory: a decentralized approach to distributed system design', Proceedings of the 6th International Conference on Distributed Computing Systems, May 86, pp. 190-197 [Delp & Farber] Delp, G. S. & Farber, D. J., `Memnet -- a different approach to a network', Technical Report, Department of Electrical Engineering, University of Delaware, ??? ------------------------------ Subject: [1.6] What have we learned? From: Distributed systems Andy Tanenbaum started a (very long) thread on this topic in comp.os.research in April of 1992 [92-04-03-17-10.05]. The interested reader is directed to the comp.os.research archives, since this thread proved rather divisive (i.e. nobody really agreed on any issue). ------------------------------ Subject: [2] Needful things From: Needful things This FAQ is incomplete, and will probably remain in this state to a greater or lesser extent for ever and ever. Should you feel willing to contribute some material, the following is a list of topics which ``urgently'' require treatment (some of which I may get around to covering myself at some point): - naming in distributed systems ================== RFC 822 Headers ==================