|
Abstract:
|
This dissertation presents techniques for detecting and tolerating faults in distributed systems . Detecting faults in distributed or parallel systems is often very difficult . We look at the problem of determining if a property or assertion was true in the computation . We formally define a logic called BTL that can be used to define such properties . Our logic takes temporal properties in consideration as these are often necessary for expressing conditions like safety violations and deadlocks . We introduce the idea of a basis of a computation with respect to a property . A basis is a compact and exact representation of the states of the computation where the property was true . We exploit the lattice structure of the computation and the structure of different types of properties and avoid brute force approaches . We have shown that it is possible to efficiently detect all properties that can be expressed by using nested negations , disjunctions , conjunctions and the temporal operators possibly and always . Our algorithm is polynomial in the number of processes and events in the system , though it is exponential in the size of the property . After faults are detected , it is necessary to act on them and , whenever possible , continue operation with minimal impact . This dissertation also deals with designing systems that can recover from faults . We look at techniques for tolerating faults in data and the state of the program . Particularly , we look at the problem where multiple servers have different data and program state and all of these need to be backed up to tolerate failures . Most current approaches to this problem involve some sort of replication . Other approaches based on erasure coding have high computational and communication overheads . We introduce the idea of fusible data structures to back up data . This approach relies on the inherent structure of the data to determine techniques for combining multiple such structures on different servers into a single backup data structure . We show that most commonly used data structures like arrays , lists , stacks , queues , and so on are fusible and present algorithms for this . This approach requires less space than replication without increasing the time complexities for any updates . In case of failures , data from the back up and other non -failed servers is required to recover . To maintain program state in case of failures , we assume that programs can be represented by deterministic finite state machines . Though this approach may not yet be practical for large programs it is very useful for small concurrent programs like sensor networks or finite state machines in hardware designs . We present the theory of fusion of state machines . Given a set of such machines , we present a polynomial time algorithm to compute another set of machines which can tolerate the required number of faults in the system . |