![]() |
HoarePrompt structurally reasons about program correctness in natural language using large language models (LLMs). Drawing inspiration from the strongest postcondition calculus, it employs a step-by-step process to generate natural language descriptions of reachable program states at various program points. By aligning formal semantics with informal requirements, HoarePrompt enables LLMs to detect bugs that violate natural language specifications.
Key features:
- Correctness assessment: classify program correctness with respect to natural language specifications, providing a verdict of
CORRECT
orINCORRECT
. - Descriptions of reachable states: infer natural language descriptions of reachable program states without executing the program.
- Loop summarization: apply a novel few-shot-driven k-induction to precisely summarize the semantics of loops in natural language.
Preprint:
- HoarePrompt: Structural Reasoning About Program Correctness in Natural Language
Dimitrios Stamatios Bouras, Yihan Dai, Tairan Wang, Yingfei Xiong, Sergey Mechtaev
https://arxiv.org/abs/2503.19599
To ensure dependency isolation and compatibility, it’s recommended to create a virtual environment:
# Create a virtual environment python3 -m venv hoareprompt-env # Activate the virtual environment source hoareprompt-env/bin/activate
Install the necessary dependencies by running:
pip install -r requirements.txt
Depending on the LLM service you’re using, set environment variables for the API keys.
For OpenAI models:
export OPENAI_API_KEY="your-openai-api-key"
For Qwen models:
export DASHSCOPE_API_KEY="your-dashscope-api-key"
You can add these export commands to your .bashrc
or .zshrc
file to avoid having to set them every time.
Additional providers and models can be configured in src/model.py
.
To motivate our approach, we first show that annotating a program with natural language descriptions of reachable program states enhances an LLM’s reasoning about program correctness.
Consider the problem: Given an integer list, the task is to find the maximum product of any sublist. The left side of the figure below presents a flawed implementation of Kadane’s algorithm for solving this problem.
The description and code for this example can be found at examples/motivating_example/description.txt
and examples/motivating_example/program.py
.
The bug in this implementation stems from incorrect indentation, which results in best_so_far
being updated only once—after the loop ends—instead of being updated during each iteration.
Despite the simplicity of this bug, state-of-the-art LLMs often fail to identify it consistently, misclassifying the program as correct. Because LLMs exhibit non-deterministic behavior and are sensitive to prompt design, testing their effectiveness in detecting such errors is challenging. To provide objective evidence, we crafted six distinct prompts—including a simple “Vanilla” prompt, a Zero-shot-CoT prompt, and an open-ended prompt—each containing the problem description and the code snippet. We generated 10 responses for each prompt using three different LLMs and various temperature settings.
Model | Not Annotated | Annotated |
---|---|---|
Qwen2.5-7B | 0% | 11.7% |
Qwen2.5-Coder-32B | 15% | 85% |
Qwen2.5-72B | 35% | 85% |
The results above show that without state annotations, LLMs rarely detect the bug. An example of an incorrect response from the Vanilla prompt is presented on the left side of the figure. However, HoarePrompt enhances this process by systematically inferring natural preconditions and postconditions from the problem description, then propagating reachable program states across the control flow.
By iteratively applying this process, HoarePrompt constructs a structured representation of program states, making it easier for LLMs to reason about correctness. With these annotations, the bug detection rate increases significantly.
You can run the motivational example using the following command:
python src/hoareprompt.py --description examples/motivating_example/description.txt --program examples/motivating_example/program.py --config configs/config_motivational.json
Alternatively, specify the type of command and the log path too:
python src/hoareprompt.py --command assess --description examples/motivating_example/description.txt --program examples/motivating_example/program.py --config configs/config_motivational.json --log logs/motivating_example/
HoarePrompt provides several commands to analyze programs.
This command evaluates whether a program conforms to a given problem description. If no specific command is provided, HoarePrompt defaults to assess
.
Usage:
CORRECT
or INCORRECT
based on the program assessment.
Generates a precondition from the