A team of researchers from Carnegie Mellon, Yale, and UC Berkeley investigating Machivallian tendencies in chatbots made a surprising side discovery: OpenAI’s GPT-4 outperformed the most skilled crowdworkers they had hired to label their dataset. This breakthrough saved the researchers over $500,000 and 20,000 hours of human labor.
Innovative Approach Driven by Cost Concerns
The researchers faced the challenge of annotating 572,322 text scenarios, and they sought a cost-effective method to accomplish this task. Employing Surge AI’s top-tier human annotators at a rate of $25 per hour would have cost $500,000 for 20,000 hours of work, an excessive amount to invest in the research endeavor. Surge AI is a venture-backed startup that performs the human labeling for numerous AI companies including OpenAI, Meta, and Anthropic.
The team tested GPT-4’s ability to automate labeling with custom prompting. Their results were definitive: “Model labels are competitive with human labels,” the researchers confidently reported.
In a comparison of 2,000 labeled data points by three expe