Cost Reduction in High Power Computing Using a Deferred Repair Strategy: A Simulation Study

Koçyiğit, Altan
Gemikonakli, Orhan
Ever, Enver
Fault-tolerant systems with repair-upon-failure strategy can become expensive in terms of labour and time. Especially for homogeneous multi-server systems, if no control hierarchy exists, postponing non essential repairs can reduce these costs without affecting the availability of the whole system significantly. Of course, while postponing these repairs, it is essential to keep the whole system capable to deal with user requests. For this purpose, usually, a threshold value is defined which represents the minimum number of servers the system administrator should keep operative. Performability evaluation of such systems is very important since the systems are fault tolerant. In this paper, the simulation of large scale multi-server systems, with identical servers, serving a stream of arriving jobs is considered. The cost of running such systems with various deferred repair strategies has been calculated and compared to the cost of using a repair-upon failure strategy.