Directions

Applicability depends on overhead: any fault tolerance approach that consumes a significant share of the computing resource is (at best) inattractive.

Applicability depends of design reliability: complex designs are often unreliable, especially when they are seldom operated (as is for fault tolerance).

Alvisi et al. carried out experiments that prove that rollback recovery, when based on a popular scheme for dynamic recording of checkpoints, has a drastic impact on system performance.

The adoption of simpler static schemes is therefore justified, although they do not optimize the amount of lost computation in case of failure.

We want to devise a simpler scheme to manage checkpoints, that has a moderate impact on system operation:

exploting the unifying concept may simplify the design;
selecting an appropriate communication model may reduce overhead.

Next: A different communication model

Previous: Motivations

International Symposium on Computers and Communications

Index