Multiprocessor computer systems are becoming increasingly important as vehicles for solving computationally expensive problems. Synchronization among the processors is achieved with a variety of clock configurations. A new notion of fault-tolerance for clock synchronization algorithms is defined, tailored to the requirements and failure patterns of shared memory multiprocessors. Algorithms in this class can tolerate any number of napping processors, where a napping processor can fail by repeatedly ceasing operation for an arbitrary time interval and then resume operation without necessarily recognizing that a fault has occurred. These algorithms guarantee that, for some fixed k, once a processor P has been working correctly for at least k time, then as long as P continues to work correctly, (1) P does not adjust its clock, and (2) P's clock agrees with the clock of every other processor that has also been working correctly for at least k time. Because a working processor must synchronize in a fixed amount of time regardless of the actions of the other processors, these algorithms are called wait-free. Another useful type of fault-tolerance is called self-stabilization: starting with an arbitrary state of the system, a self-stabilizing algorithm eventually reaches a point after which it correctly performs its task. Two wait-free clock synchronization algorithms are presented for a model with global clock pulses. The first one is self-stabilizing; the second one is not but it converges more quickly than the first one. The self-stabilizing algorithm requires each processor's communication register contents to be a part of the processor's state. This last requirement is proven necessary. A wait-free clock synchronization algorithm is also presented for a model with local clock pulses. This algorithm is not self-stabilizing.
- Clock synchronization
- Distributed computing