Leveraging Formal and Semi-formal Methods
Marc Brooker and Ankush Desai
AWS (Amazon Web Services) strives to deliver reliable services that customers can trust completely. This demands maintaining the highest standards of security, durability, integrity, and availability—with systems correctness serving as the cornerstone for achieving these priorities. An April 2015 paper published in Communications of the ACM, titled “How Amazon Web Services Uses Formal Methods,” highlighted the approach for ensuring the correctness of critical services that have since become among the most widely used by AWS customers.21
Central to this approach was TLA+,14 a formal specification language developed by Leslie Lamport. Our experience at AWS with TLA+ revealed two significant advantages of applying formal methods in practice. First, we could identify and eliminate subtle bugs early in development—bugs that would have eluded traditional approaches like testing. Second, we gained the deep understanding and confidence needed to implement aggressive performance optimizations while maintaining systems correctness.
Moreover, 15 years ago, AWS’s software testing practice relied primarily on build-time unit testing, often against mocks, and limited deployment-time integration testing. Since then, we have significantly evolved our correctness practices, integrating both formal and semi-formal approaches into the development process. As AWS has grown, formal methods have become increasingly valuable—not only for ensuring correctness but also for performance improvements, particularly in verifying the correctness of both low- and high-level optimizations. This systematic approach toward systems correctness has become a force multiplier at AWS’s scale, enabling faster development cycles through improved developer velocity while delivering more cost-effective services to customers.
This article surveys the portfolio of formal methods used across AWS to deliver complex services with high confidence in its correctness. We consider an umbrella definition of formal methods that encompasses these rigorous techniques—from traditional formal approaches such as theorem proving,7,10 deductive verification,18 and model checking8,14 to more lightweight semi-formal approaches such as property-based testing,6,19 fuzzing,9 and runtime monitoring.11
The P Programming Language
As the use of formal methods was expanded beyond the initial teams at AWS in the early 2010s, we discovered that many engineers struggled to learn and become productive with TLA+. This difficulty seemed to stem from TLA+’s defining feature: It is a high-level, abstract language that more closely resembles mathematics than the imperative programming languages most developers are familiar with. While this mathematical nature is a significant strength of TLA+, and we continue to agree with Lamport’s views on the benefits of mathematical thinking,15 we also sought a language that would allow us to model check (and later prove) key aspects of systems designs while being more approachable to programmers.
We found this balance in the P programming language.8 P is a state-machine-based language for modeling and analysis of distributed systems. Using P, developers model their system designs as communicating state machines, a mental model familiar to Amazon’s developer population, most of whom develop systems based on microservices and SOAs (service-oriented architectures). P has been developed at AWS since 2019 and is maintained as a strategic open-source project.22 Teams across AWS that build some of its flagship products—from storage (e.g., Amazon S3, EBS), to databases (e.g., Amazon DynamoDB, MemoryDB, Aurora), to compute (e.g., EC2, IoT)—have been using P to reason about the correctness of their system designs.
For example, P was used in migrating S3 (Simple Storage Service) from eventual to strong read-after-write consistency.1 A key component of S3 is its index subsystem, an object metadata store that enables fast data lookups. To achieve strong consistency, the S3 team had to make several nontrivial changes to the S3 index protocol stack.25 Because these changes were difficult to get right at S3 scale, and the team wanted to deliver strong consistency with high confidence in correctness, they used P to formally model and validate the protocol design. P helped eliminate several design-level bugs early in the development process and allowed the team to deliver risky optimizations with confidence, as they could be validated using model checking.
In 2023, the P team at AWS built PObserve, which provides a new tool for validating the correctness of distributed systems both during testing and in production. With PObserve, we take structured logs from the execution of distributed systems and validate post-hoc that they match behaviors allowed by the formal P specification of the system. This allows for bridging the gap between the P specification of the system design and the production implementation (typically in languages like Rust or Java). While there are significant benefits from verifying protocols at design time, runtime monitoring of the same properties for the implementation makes the investment in formal specification much more valuable and addresses classic concerns with the deployment of formal methods in practice (i.e., connecting design-time validation with system implementation).
Lightweight Formal Methods
Another way that AWS has brought formal methods closer to its engineering teams is through the adoption of lightweight formal methods.
Property-based testing
The most notable single example of leveraging light-weight formal method is in Amazon S3’s ShardStore, where the team used property-based testing throughout the development cycle both to test correctness and to speed up the development (described in detail by Bornholt, et al.4). The key idea in their approach was combining property-based testing with developer-provided correctness specifications, coverage-guided fuzzing (an approach where the distribution of inputs is guided by code coverage metrics), failure injection (where hardware and other system failures are simulated during testing), and minimization (where counterexamples are automatically reduced to aid human-guided debugging).
Deterministic simulation
Another lightweight method widely used at AWS is deterministic simulation testing, in which a distributed system is executed on a single-threaded simulator with control over all sources of randomness such as thread scheduling, timing, and message delivery order. Tests are then written for particular failure or success scenarios, such as the failure of a participant at a particular stage in a distributed protocol. The nondeterminism in the system is controlled by the test framework, allowing developers to specify orderings that they believe are interesting (such as ones that have caused bugs in the past). The scheduler in the testing framework can also be extended for fuzzing of orderings or exploring all possible orderings to be tested.
Deterministic simulation testing moves testing of system properties, like behavior under delay and failure, closer to build time instead of integration testing. This accelerates development and provides for more complete behavioral coverage during testing. Some of the work done at AWS on build-time testing of thread ordering and systems failures has been open-sourced as part of the shuttle (https://github.com/awslabs/shuttle) and turmoil (https://github.com/tokio-rs/turmoil) projects.
Continuous fuzzing or random test-input generation
Continuous fuzzing, especially coverage-guided scalable test-input generation, is also effective for testing systems correctness at integration time. During the development of Amazon Aurora’s data-sharding feature (Aurora Limitless Database3), for example, we made extensive use of fuzzing to test two key properties of the system. First, by fuzzing SQL queries (and entire transactions), we validated that the logic partitioning SQL execution over shards is correct. Large volumes of random SQL schemas, datasets, and queries are synthesized and run through the engines under test, and the results compared with an oracle based on the nonsharded version of the engine (as well as other approaches to validation, like those pioneered by SQLancer23).
F
7 Comments
OhMeadhbh
I find this highly unlikely. My first day at Amazon I encountered an engineer puzzling over a perfect sine wave in a graph. After looking at the scale I made the comment "oh. you're using 5 minute metrics." Their response was "how could you figure that out just by looking at the graph?" When I replied "Queuing theory and control theory," their response was "what's that?"
Since then, Amazon's hiring practices have only gotten worse. Can you invert a tree? Can you respond "tree" or "hash map" when you're asked what is the best data structure for a specific situation? Can you solve a riddle or code an ill-explained l33tcode problem? Are you sure you can parse HTML with regexes? You're Amazon material.
Did you pay attention to the lecture about formal proofs. TLA+ or Coq/Kami? That's great, but it won't help you get a job at Amazon.
The idea that formal proofs are used anywhere but the most obscure corners of AWS is laughable.
Although… it is a nice paper. Props to Amazon for supporting Ph.D.'s doing pure research that will never impact AWS' systems or processes.
nullorempty
And what teams use these methods exactly?
pera
> we also sought a language that would allow us to model check (and later prove) key aspects of systems designs while being more approachable to programmers.
I find it a bit surprising that TLA+ with PlusCal can be challenging to learn for your average software engineer, could anyone with experience in P show an example of something that can be difficult to express in TLA+ which is significantly easier in P?
jlcases
I've noticed that the formalization of methods described by AWS parallels what we need in technical documentation. Complex systems require not just formal verification but also structured documentation following MECE principles (Mutually Exclusive, Collectively Exhaustive).
In my experience, the interfaces between components (where most errors occur) are exactly where fragmented documentation fails. I implemented a hierarchical documentation system for my team that organizes knowledge as a conceptual tree, and the accuracy of code generation with AI assistants improved notably.
Formal verification tools and structured documentation are complementary: verification ensures algorithmic correctness while MECE documentation guarantees conceptual and contextual correctness. I wonder if AWS has experimented with structured documentation systems specifically for AI, or if this remains an area to explore.
csbartus
I've recently created a likely-correct piece of software based on these principles.
https://www.osequi.com/studies/list/list.html
The structure (ontology, taxonomy) is created with ologs, a formal method from category theory. The behavior (choreography) is created with a semi-formal implementation (XState) of a formal method (Finite State Machines)
The user-facing aspect of the software is designed with Concept Design, a semi-formal method from MIT CSAIL.
Creating software with these methods is refreshing and fun. Maybe one day we can reach Tonsky's "Diagrams are code" vision.
https://tonsky.me/blog/diagrams/
wg0
[flagged]
gqgs
A key concern I've consistently had regarding formal verification systems is: how does one confirm the accuracy of the verifier itself?
This issue appears to present an intrinsically unsolvable problem, implying that a formally verified system could still contain bugs due to potential issues in the verification software.
While this perspective doesn't necessarily render formal verification impractical, it does introduce certain caveats that, in my experience, are not frequently addressed in discussions about these systems.