Yesterday, OpenAI released Deep Research, a system that browses the web to summarize content and answer questions based on the summary. The system is impressive and blew our minds when we tried it for the first time.
One of the main results in the blog post is a strong improvement of performances on the General AI Assistants benchmark (GAIA), a benchmark we’ve been playing with recently as well, where they successfully reached near 67% correct answers on 1-shot on average, and 47.6% on especially challenging “level 3” questions that involve multiple steps of reasoning and tool usage (see below for a presentation of GAIA).
DeepResearch is composed of an LLM (which can be selected from the current list of LLMs provided by OpenAI, 4o, o1, o3, etc) and an internal “agentic framework” which guide the LLM to use tools like web search and organize its actions in steps.
While powerful LLMs are now freely available in open-source (see e.g. the recent DeepSeek R1 model), OpenAI didn’t disclose much about the agentic framework underlying Deep Research…
So we decided to embark on a 24-hour mission to reproduce their results and open-source the needed framework along the way!
The clock is ticking, let’s go! ⏱️
Table of Contents
- What are Agent frameworks and why they matter?
- The GAIA benchmark
- Building an open Deep Research
- Results 🏅
- Community reproductions
- Most important next steps
What are Agent frameworks and why they matter?
An Agent framework is a layer on top of an LLM to make said LLM execute actions (like browse the web or read PDF documents), and organize its operations in a series of steps.
For a quick intro to agents, check this great interview by Andrew Ng and our introduction blog post to the smolagents library. For a more detailed dive in agents you can subscribe to our agents course that starts in just a few days: link here.
Almost everyone has already experienced how powerful LLMs can be simply by playing with chatbots.. However, what not everyone is aware of yet is that integrating these LLMs into agentic systems can give them real superpowers!
Here is a recent example comparing the performance of a few frontier LLMs with and without an agentic framework (in this case the simple smolagents library) – using an agentic framework bumps performance by up to 60 points!
In fact, OpenAI also highlighted in its release blogpost how Deep Research performed dramatically better than standalone LLMs on the knowledge-intensive “Humanity’s Last Exam” benchmark.
So, what happens when we integrate our current top LLM in an agentic framework, to work toward an open-DeepResearch
?
A quick note: We’ll benchmark our results on the same GAIA challenge but keep in mind that this is a work in progress. DeepResearch is a massive achievement and its open reproduction will take time. In particular, full parity will require improved browser use and interaction like OpenAI Operator is providing, i.e. beyond the current text-only web interaction we explore in this first step.
Let’s first understand the scope of the challenge: GAIA.
The GAIA benchmark
GAIA is arguably the most comprehensive benchmark for agents. Its questions are very difficult and hit on many challenges of LLM-based systems. Here is an example of a hard question:
Which of the fruits shown in the 2008 painting “Embroidery from Uzbekistan” were served as part of the October 1949 breakfast menu for the ocean liner that was later used as a floating prop for the film “The Last Voyage”? Give the items as a comma-separated list, ordering them in clockwise order based on their arrangement in the painting starting from the 12 o’clock position. Use the plural form of each fruit.
You can see this question involves several challenges:
- Answering in a constrained format,
- Using multimodal capabilities (to extract the fruits from the image),
- Gathering several pieces of information, some depending on others:
- Identifying the fruits on the picture
- Finding which ocean liner was used as a floating prop for “The Last Voyage”
- Finding the October 1949 breakfast menu for the above ocean liner
- Chaining together a problem-solving trajectory in the correct order.
Solving this requires both high-level planning abilities and rigorous execution, which are two areas where LLMs struggle when used alone.
So it’s an excellent test set for agent systems!
On GAIA’s public leaderboard, GPT-4 does not even reach 7% on the validation set when used without any agentic setup. On the other side of the spectrum, with Deep Research, OpenAI reached 67.36% score on the validation set, so an order of magnitude better! (Though we don’t know how they would actually fare on the private test set.)
Let’s see if we can do better with open source tools!
Building an open Deep Research
Using a CodeAgent
The first improvement over traditional AI agent systems we’ll tackle is to use a so-called “code agent”. As shown by Wang et al. (2024), letting the agent express its actions in code has several advantages, but most notably that code is specifically designed to express complex sequences of actions.
Consider this example given by Wang et al.:
This highlights several advantages of using code:
- Code actions are much more concise than JSON.
- Need to run 4 parallel streams of 5 consecutive actions ? In JSON, you would need to generate 20 JSON blobs, each in their separate step; in Code it’s only 1 step.
- On average, the paper shows that Code actions require 30% fewer steps than JSON, which amounts to an equivalent reduction in the tokens generated. Since LLM calls are often the dimensioning cost of agent systems, it means your agent system runs are ~30% cheap