Use of Formal Methods at Amazon Web Services

To safeguard that data, the core of each service relies on fault-tolerant distributed algorithms for replication, consistency, concurrency control, auto-scaling, load balancing, and other coordination tasks.

We use deep design reviews, code reviews, static code analysis, stress testing, fault-injection testing, and many other techniques, but we still find that subtle bugs can hide in complex concurrent fault-tolerant systems. One reason for this problem is that human intuition is poor at estimating the true probability of supposedly ’extremely rare’ combinations of events in systems operating at a scale of millions of requests per second.

Precise Designs

As our designs are unavoidably complex, we needed a highly expressive language, far above the level of code, but with precise semantics. That expressivity must cover real-world concurrency and fault-tolerance. And, as we wish to build services quickly, we wanted a language that is simple to learn and apply, avoiding esoteric concepts. We also very much wanted an existing ecosystem of tools. In summary, we were looking for an off-the-shelf method with high return on investment.

TLA+ is based on simple discrete math, i.e. basic set theory and predicates. A TLA+ specification describes the set of all possible legal behaviors (execution traces) of a system. In TLA+, correctness properties and system designs are just steps on a ladder of abstraction, with correctness properties occupying higher levels, systems designs and algorithms in the middle, and executable code and hardware at the lower levels.

a system design correctly implements the desired correctness properties, either via

  • conventional mathematical reasoning
  • more easily and quickly by using tools such as the TLC model checker, a tool which takes a TLA+ specification and exhaustively checks the desired correctness properties across all of the possible execution traces.

The value of Formal Methods for ‘Real-world Systems’

Side Benefit: A Better Way to Design Systems

In contrast, when using formal specification we begin by precisely stating “what needs to go right?”

First we state what the system should do, by defining correctness properties. These come in two varieties:

  • Safety properties: “what the system is allowed to do”
  • Liveness properties: “what the system must eventually do”

More Side Benefits: Improved Understanding, Productivity and Innovation

In addition, a precise, testable, well commented description of a design is an excellent form of documentation.

A formal specification is precise, short, and can be explored and experimented upon with tools.

What Formal Specification Is Not Good For

Two major classes of problems with large distributed systems:

  1. bugs and operator errors that cause a departure from the logical intent of the system
  2. surprising ‘sustained emergent performance degradation’ of complex systems that inevitably contain feedback loops

TLA+ could be used to specify an upper bound on response time, as a real-time safety property. However, out systems are built on infrastructure (disks, operating systems, network) that do not support hard real-time scheduling or guarantees, so real-time safety properties would not be realistic. We build soft real-time systems in which very short periods of slow responses are not considered errors. However, prolonged severe slowdowns are considered errors. We don’t yet know of a feasible way to model a real system that would enable tools to predict such emergent behavior. We use other techniques to mitigate those risks.

First Steps To Formal Methods

First Big Success At Amazon

Therefore, using TLA+ in place of traditional proof writing would likely have improved time-to-market in addition to achieving higher confidence on system correctness.

Persuading More Engineers Leads to Further Successes

TLA+: Exhaustively testable pseudo-code

TLA+:

  • an excellent tool for data modeling, e.g. designing the schema for a relational or ‘No SQL’ database.
  • design a non-trivial schema, with semantic invariants over the data that were much richer than standard multiplicity constraints and foreign key constraints. That suggests that a data model might best be viewed as just another level of abstraction of the entire system.

The Most Frequently Asked Question

Alternatives to TLA+

Conclusion

Formal methods deal with models of systems, not the systems themselves, so the adage applies; “All models are wrong, some are useful.”