How Much Better Is OpenAI’s Newest GPT-3 Model? by rkaplan

Share This Article

Sed ut perspiciatis unde.

Engineering

by Spencer Papay, Sam Waterbury and Russell Kaplan on November 30th, 2022

How Much Better is OpenAI’s Newest GPT-3 Model? cover

On November 28th, OpenAI released a new addition to the GPT-3 model family: davinci-003. This latest model builds on InstructGPT, using reinforcement learning with human feedback to better align language models with human instructions. Unlike davinci-002, which uses supervised fine-tuning on human-written demonstrations and highly scored model samples to improve generation quality, davinci-003 is a true reinforcement learning with human feedback (RLHF) model. It is trained with PPO to optimize the generated text’s score against a separate “reward model”, which is trained on rating comparisons by human graders of different model outputs. More details can be found in OpenAI’s model index for researchers. The net result is that davinci-003 is tuned to produce outputs that it thinks would be scored highly by humans. Scale is proud to partner with OpenAI to provide this human feedback.

OpenAI’s announcement email mentions the following improvements for davinci-003:

“It produces higher quality writing. This will help your applications deliver clearer, more engaging, and more compelling content.
It can handle more complex instructions, meaning you can get even more creative with how you make use of its capabilities now.
It’s better at longer form content generation, allowing you to take on tasks that would have previously been too difficult to achieve.”

But how much better is davinci-003, really? We decided to put it to the test quantitatively. Using Scale Spellbook, the platform to build, compare and deploy large language model apps, we evaluated davinci-003 versus davinci-002 across tasks ranging from few- and zero-shot classification, summarization, and poetry writing. Here’s what we found.

Classification

It’s well known that davinci-002 performs very well on classification with a few-shot prompt, and it appears that davinci-003 offers comparable (but slightly worse) performance few-shot. Using Spellbook’s built-in evaluation feature on 250 Yelp reviews, we compare the two models’ classification of these reviews as Positive, Negative, or Neutral in a four-shot prompt — meaning four labeled examples are includ

How Much Better Is OpenAI’s Newest GPT-3 Model? by rkaplan

How Much Better Is OpenAI’s Newest GPT-3 Model? by rkaplan

Share This Article

Newsletter

Engineering

Classification

HackTech

Leave a comment Cancel reply

Editor's Choice

How Much Better Is OpenAI’s Newest GPT-3 Model? by rkaplan

How Much Better Is OpenAI’s Newest GPT-3 Model? by rkaplan

Share This Article

Newsletter

Engineering

Classification

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter