Fault Tolerance on the Cheap: How to Build Systems That Probably Won't Fall Over

Written by: Brian Troutwine

The property of fault tolerance is desirable in common systems, but there is little common literature on the subject. What does exist is seemingly out of date -- being associated with either a defunct computer manufacturer or something proposed in early relational-database works or specific, esoteric functional programming languages.

Achieving fault tolerance, however, is not an esoteric matter. It is practical and approachable in an iterative fashion, fault-tolerance being a matter of degrees, amenable to trade-offs necessitated by budgets or organizational needs.

This article is Part I in a two-part series that will discuss a high-level approach to building fault-tolerant systems. Here, we'll focus on background information.

What Is Fault Tolerance?

The late Jim Gray defined fault-tolerant systems as being those in which "parts of the system may fail but the rest of the system must tolerate failures and continue delivering service." This is taken from Dr. Gray's article "Why Do Computer Stop and What Can Be Done About It?", a 1985 paper about Tandem Computers' NonStop, a computer system with fully redundant hardware.

Dr. Gray was concerned with the software running on these systems and the interaction with human operators, noting that these two categories made up the majority of observed failures. (That this undermined the business case for the Tandem NonStop was not mentioned in the article, but the trend of history toward deploying on cheap, fault-prone hardware and solving for this at the software level was inevitable.)

There are two things to unpack in Dr. Gray's definition:

  • "fail"

  • "continue delivering service"

Consider a software/hardware system as a total abstraction, with each subcomponent of the system being defined along some boundary interface. A CPU is such a component, speaking over certain buses, obeying certain properties. An ORM model object is another, obeying a certain call protocol and behaving in a certain fashion. The exact semantics of components are often unspecified, but a large portion can be discovered through trial and error experimentation.

A component that violates its discovered semantics has "failed." This could be a permanent violation -- maybe someone set your CPU on fire -- or it could be temporary -- perhaps someone shut the database your model object abstracts but they'll turn it back on again. But what truly matters is that a component of the system is behaving in a fashion not anticipated.

This is where "continue delivering service" comes in. Seen from the outside, the system (the conglomeration of components) itself possesses some interface with some discoverable behavior. The service produced by this system is that behavior. A fault-tolerant system is one in which the unanticipated actions of a subcomponent do not bubble out as unanticipated behavior from the system as a whole. Your database may go offline and your ORM object may fail, but the caller of that object copes and things go on as expected.

But how do you make such a thing?

hbspt.cta.load(1169977, '11903a5d-dfb4-42f2-9dea-9a60171225ca');

Approaches to Coping

There are three broad approaches that organizations take to the production of fault-tolerant systems. Naturally, each have their tradeoffs.

Option 1: Perfection

In this option, organizations reduce the probability of undiscovered behaviour in system components down to significantly low values. Consider the process described in They Write the Right Stuff, a 1996 Fast Company article on the Space Shuttle flight computer development. Because of the critical nature of the flight computer in the shuttle's operation, unusual constraints were placed on the construction of the software. In particular:

  • Total control was held over the hardware. Everything was described in-spec and custom made. Unknowns were intolerable, so no unknowns were accepted.

  • Total understanding of the problem domain was required. Every little bit of physics was worked out, and the cooperation of the astronauts with the computer system was scripted.

  • Specific, explicit goals were defined by the parent organization (in this case NASA). The shuttle had a particular job to do, and this was defined in detail.

  • The service lifetime of the system was defined in advance. That is, the running time of the computer was set, allowing for formal methods that assumed fixed runtimes were employed.

This approach was not without its failures; the first orbiter flight was delayed by a software bug, in fact, but each of the shuttle's catastrophes were attributable to mechanical faults, not software.

The downside of taking the perfection route is that it is extremely expensive and stifling. When an organization attempts perfection, it is explicitly stating that the final system is all that matters. Discovering some cool new trick the hardware can do in addition to its stated goals do not matter. The intellectual freedom of the engineers -- outside of the strictures of the All Important Process -- do not matter. All that matters is the end result.

Working in this fashion requires a boatload of money, a certain kind of engineer, sophisticated planning, and time. Systems produced like this are fault-tolerant because all possible faults are accounted for and handled, usually with a runbook and a staff of expert operators on standby.

Option 2: Hope for the best

On the flip side is the "hope for the best" model of system development. This is exemplified by a certain social media company's "Move Fast and Break Things" former motto.

Such a model requires little upfront understanding of the problem domain, coupled to very short-term goals, often "Just get something online." It is also often cheaper, in the short-term, to pull off; the costs associated with upfront planning are entirely avoided, and the number of engineers needed to produce something (anything) are also avoided.

Organizations taking this approach are implicitly stating that the future system and its behavior matters less than the research done to produce it, sometimes resulting in a Silicon-Valley-style "pivot."

The downside of this approach can be seen in the longer view. Ignorance of the problem domain will often result in long-term system issues, which may be resolvable but not without significant expense. Failures in a system produced in this way do propagate out to users, which may or may not be an issue depending on the system.

The most pernicious thing about this model is its cultural impact. That is, it's difficult to flip a switch in an organization and declare that, today, all software must be high quality. Ad hoc verification practices, poor operational management, these things linger after the organization declares them to be no longer a virtue but a liability.

Sometimes, but not always, redundancy will be featured in a production "hope for the best" system, and the fail-over between components may or may not be tested. Such systems are often accidentally able to cope with failures and are kept online through ingenuity and coolness under fire.

Option 3: Embracing failure

Sitting in between these two options is the embrace of undiscovered faults as a first-class component of a system. Part Two of this series on fault tolerance will discuss the final option in-depth. Stay tuned!

Stay up to date

We'll never share your email address and you can opt out at any time, we promise.