What should we believe about the reasoning abilities of today’s large language models? As the headlines above illustrate, there’s a debate raging over whether these enormous pre-trained neural networks have achieved humanlike reasoning abilities, or whether their skills are in fact “a mirage.”
Reasoning is a central aspect of human intelligence, and robust domain-independent reasoning abilities have long been a key goal for AI systems. While large language models (LLMs) are not explicitly trained to reason, they have exhibited “emergent” behaviors that sometimes look like reasoning. But are these behaviors actually driven by true abstract reasoning abilities, or by some other less robust and generalizable mechanism—for example, by memorizing their training data and later matching patterns in a given problem to those found in training data?
Why does this matter? If robust general-purpose reasoning abilities have emerged in LLMs, this bolsters the claim that such systems are an important step on the way to trustworthy general intelligence. On the other hand, if LLMs rely primarily on memorization and pattern-matching rather than true reasoning, then they will not be generalizable—we can’t trust them to perform well on “out of distribution” tasks, those that are not sufficiently similar to tasks they’ve seen in the training data.
What Is “Reasoning”?
The word “reasoning” is an umbrella term that includes abilities for deduction, induction, abduction, analogy, common sense, and other “rational” or systematic methods for solving problems. Reasoning is often a process that involves composing multiple steps of inference. Reasoning is typically thought to require abstraction—that is, the capacity to reason is not limited to a particular example, but is more general. If I can reason about addition, I can not only solve 23+37, but any addition problem that comes my way. If I learn to add in base 10 and also learn about other number bases, my reasoning abilities allow me to quickly learn to add in any other base.
“Chain of Thought” Reasoning in LLMs
In the last few years, there has been a deluge of papers making claims for reasoning abilities in LLMs (Huang & Chang give one recent survey). One of the most influential such papers (by Wei et al. from Google Research) proposed that so-called “Chain of Thought” (CoT) prompting elicits sophisticated reasoning abilities in these models. A CoT prompt gives one or more examples of a problem and the reasoning steps needed to solve it, and then poses a new problem. The LLM will then output text that follows a similar set of reasoning steps before outputting an answer. This figure from the paper by Wei et al. gives an example:
Wei et al. tested the CoT prompting method on datasets with math word problems, commonsense reasoning problems, and other domains. In many cases, CoT prompting dramatically improved the performance of LLMs on these datasets.
Two downsides of CoT prompting are (1) it takes some effort on the part of the human prompter to construct such prompts; and (2) often, being able to construct a CoT example to include in the prompt requires the prompter to already know how to solve the problem being given to the LLM!
A subsequent paper proposed that these problems can be avoided by leaving out the CoT example, and just posing the problem to an LLM with the added phrase “Let’s think step by step.” Here’s an example from that paper:
Amazingly, this simple addition to a prompt—termed “zero-shot CoT prompting”— substantially improves LLM’s performance on several reasoning benchmarks as compared with prompts without the addition of “Let’s think step by step.”
Why does CoT prompting—whether with an example or with “Let’s think step by step”—have this dramatic effect on LLMs? Interestingly, while there are many hypotheses, this question has not been definitively answered.
Are the Reasoning Steps Generated Under CoT Prompting Faithful to the Actual Reasoning Process?
While the above examples of CoT and zero-shot CoT prompting show the language model generating text that looks like correct step-by-step reasoning about the given problem, one can ask if the text the model generates is “faithful”—that is, does it describe the actual process of reasoning that the LLM uses to solve the problem? LLMs are not trained to generate text that accurately reflects their own internal “reasoning” processes; they are trained to generate only plausible-sounding text in response to a prompt. What, then, is the connection between the generated text and the LLM’s actual processes of coming to an answer?
This question has been addressed by