On November 28th, OpenAI released a new addition to the GPT-3 model family: davinci-003
. This latest model builds on InstructGPT, using reinforcement learning with human feedback to better align language models with human instructions. Unlike davinci-002
, which uses supervised fine-tuning on human-written demonstrations and highly scored model samples to improve generation quality, davinci-003
is a true reinforcement learning with human feedback (RLHF) model. It is trained with PPO to optimize the generated text’s score against a separate “reward model”, which is trained on rating comparisons by human graders of different model outputs. More details can be found in OpenAI’s model index for researchers. The net result is that davinci-003
is tuned to produce outputs that it thinks would be scored highly by humans. Scale is proud to partner with OpenAI to provide this human feedback.
OpenAI’s announcement email mentions the following improvements for davinci-003
:
- “It produces higher quality writing. This will help your applications deliver clearer, more engaging, and more compelling content.
- It can handle more complex instructions, meaning you can get even more creative with how you make use of its capabilities now.
- It’s better at longer form content generation, allowing you to take on tasks that would have previously been too difficult to achieve.”
But how much better is davinci-003
, really? We decided to put it to the test quantitatively. Using Scale Spellbook, the platform to build, compare and deploy large language model apps, we evaluated davinci-003
versus davinci-002
across tasks ranging from few- and zero-shot classification, summarization, and poetry writing. Here’s what we found.
Classification
It’s well known that davinci-002
performs very well on classification with a few-shot prompt, and it appears that davinci-003
offers comparable (but slightly worse) performance few-shot. Using Spellbook’s built-in evaluation feature on 250 Yelp reviews, we compare the two models’ classification of these reviews as Positive, Negative, or Neutral in a four-shot prompt — meaning four labeled examples are includ