HoarePrompt: Structural Reasoning About Program Correctness in Natural Language by mechtaev

Share This Article

Sed ut perspiciatis unde.

HoarePrompt structurally reasons about program correctness in natural language using large language models (LLMs). Drawing inspiration from the strongest postcondition calculus, it employs a step-by-step process to generate natural language descriptions of reachable program states at various program points. By aligning formal semantics with informal requirements, HoarePrompt enables LLMs to detect bugs that violate natural language specifications.

Key features:

Correctness assessment: classify program correctness with respect to natural language specifications, providing a verdict of CORRECT or INCORRECT.
Descriptions of reachable states: infer natural language descriptions of reachable program states without executing the program.
Loop summarization: apply a novel few-shot-driven k-induction to precisely summarize the semantics of loops in natural language.

Preprint:

HoarePrompt: Structural Reasoning About Program Correctness in Natural Language
Dimitrios Stamatios Bouras, Yihan Dai, Tairan Wang, Yingfei Xiong, Sergey Mechtaev
https://arxiv.org/abs/2503.19599

1. Set up a Virtual Environment

To ensure dependency isolation and compatibility, it’s recommended to create a virtual environment:

# Create a virtual environment
python3 -m venv hoareprompt-env

# Activate the virtual environment
source hoareprompt-env/bin/activate

Install the necessary dependencies by running:

pip install -r requirements.txt

Depending on the LLM service you’re using, set environment variables for the API keys.

For OpenAI models:

export OPENAI_API_KEY="your-openai-api-key"

For Qwen models:

export DASHSCOPE_API_KEY="your-dashscope-api-key"

You can add these export commands to your .bashrc or .zshrc file to avoid having to set them every time.

Additional providers and models can be configured in src/model.py.

To motivate our approach, we first show that annotating a program with natural language descriptions of reachable program states enhances an LLM’s reasoning about program correctness.

State Descriptions Enhance LLM Reasoning

Consider the problem: Given an integer list, the task is to find the maximum product of any sublist. The left side of the figure below presents a flawed implementation of Kadane’s algorithm for solving this problem.
The description and code for this example can be found at examples/motivating_example/description.txt and examples/motivating_example/program.py.

The bug in this implementation stems from incorrect indentation, which results in best_so_far being updated only once—after the loop ends—instead of being updated during each iteration.

Despite the simplicity of this bug, state-of-the-art LLMs often fail to identify it consistently, misclassifying the program as correct. Because LLMs exhibit non-deterministic behavior and are sensitive to prompt design, testing their effectiveness in detecting such errors is challenging. To provide objective evidence, we crafted six distinct prompts—including a simple “Vanilla” prompt, a Zero-shot-CoT prompt, and an open-ended prompt—each containing the problem description and the code snippet. We generated 10 responses for each prompt using three different LLMs and various temperature settings.

Model	Not Annotated	Annotated
Qwen2.5-7B	0%	11.7%
Qwen2.5-Coder-32B	15%	85%
Qwen2.5-72B	35%	85%

The results above show that without state annotations, LLMs rarely detect the bug. An example of an incorrect response from the Vanilla prompt is presented on the left side of the figure. However, HoarePrompt enhances this process by systematically inferring natural preconditions and postconditions from the problem description, then propagating reachable program states across the control flow.

By iteratively applying this process, HoarePrompt constructs a structured representation of program states, making it easier for LLMs to reason about correctness. With these annotations, the bug detection rate increases significantly.

Running the Motivational Example

You can run the motivational example using the following command:

python src/hoareprompt.py --description examples/motivating_example/description.txt --program examples/motivating_example/program.py --config configs/config_motivational.json

Alternatively, specify the type of command and the log path too:

python src/hoareprompt.py --command assess --description examples/motivating_example/description.txt --program examples/motivating_example/program.py --config configs/config_motivational.json --log logs/motivating_example/

HoarePrompt provides several commands to analyze programs.

1. Assess (Default Command)

This command evaluates whether a program conforms to a given problem description. If no specific command is provided, HoarePrompt defaults to assess.

Usage:

This outputs either CORRECT or INCORRECT based on the program assessment.

Generates a precondition from the

HoarePrompt: Structural Reasoning About Program Correctness in Natural Language by mechtaev

HoarePrompt: Structural Reasoning About Program Correctness in Natural Language by mechtaev

Share This Article

Newsletter

1. Set up a Virtual Environment

State Descriptions Enhance LLM Reasoning

Running the Motivational Example

1. Assess (Default Command)

HackTech

Leave a comment Cancel reply

Editor's Choice

HoarePrompt: Structural Reasoning About Program Correctness in Natural Language by mechtaev

HoarePrompt: Structural Reasoning About Program Correctness in Natural Language by mechtaev

Share This Article

Newsletter

1. Set up a Virtual Environment

State Descriptions Enhance LLM Reasoning

Running the Motivational Example

1. Assess (Default Command)

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter