Transformers Agent is an experimental API which is subject to change at any time. Results returned by the agents
can vary as the APIs or underlying models are prone to change.
Transformers version v4.29.0, building on the concept of tools and agents. You can play with in
this colab.
In short, it provides a natural language API on top of transformers: we define a set of curated tools and design an
agent to interpret natural language and to use these tools. It is extensible by design; we curated some relevant tools,
but we’ll show you how the system can be extended easily to use any tool developed by the community.
Let’s start with a few examples of what can be achieved with this new API. It is particularly powerful when it comes
to multimodal tasks, so let’s take it for a spin to generate images and read text out loud.
agent.run("Caption the following image", image=image)
Input | Output |
---|---|
![]() |
A beaver is swimming in the water |
agent.run("Read the following text out loud", text=text)
Input | Output |
---|---|
A beaver is swimming in the water |
agent.run(
"In the following `document`, where will the TRRF Scientific Advisory Council Meeting take place?",
document=document,
)
Input | Output |
---|---|
![]() |
ballroom foyer |
Quickstart
Before being able to use agent.run
, you will need to instantiate an agent, which is a large language model (LLM).
We provide support for openAI models as well as opensource alternatives from BigCode and OpenAssistant. The openAI
models perform better (but require you to have an openAI API key, so cannot be used for free); Hugging Face is
providing free access to endpoints for BigCode and OpenAssistant models.
To use openAI models, you instantiate an OpenAiAgent:
from transformers import OpenAiAgent agent = OpenAiAgent(model="text-davinci-003", api_key="" )
To use BigCode or OpenAssistant, start by logging in to have access to the Inference API:
from huggingface_hub import login login("" )
Then, instantiate the agent
from transformers import HfAgent agent = HfAgent("https://api-inference.huggingface.co/models/bigcode/starcoder")
This is using the inference API that Hugging Face provides for free at the moment. If you have your own inference
endpoint for this model (or another one) you can replace the URL above with your URL endpoint.
StarCoder and OpenAssistant are free to use and perform admirably well on simple tasks. However, the checkpoints
don’t hold up when handling more complex prompts. If you’re facing such an issue, we recommend trying out the OpenAI
model which, while sadly not open-source, performs better at this given time.
You’re now good to go! Let’s dive into the two APIs that you now have at your disposal.
Single execution (run)
The single execution method is when using the run() method of the agent:
agent.run("Draw me a picture of rivers and lakes.")
It automatically selects the tool (or tools) appropriate for the task you want to perform and runs them appropriately. It
can perform one or several tasks in the same instruction (though the more complex your instruction, the more likely
the agent is to fail).
agent.run("Draw me a picture of the sea then transform the picture to add an island")
Every run() operation is independent, so you can run it several times in a row with different tasks.
Note that your agent
is just a large-language model, so small variations in your prompt might yield completely
different results. It’s important to explain as clearly as possible the task you want to perform. We go more in-depth
on how to write good prompts here.
If you’d like to keep a state across executions or to pass non-text objects to the agent, you can do so by specifying
variables that you would like the agent to use. For example, you could generate the first image of rivers and lakes,
and ask the model to update that picture to add an island by doing the following:
picture = agent.run("Generate a picture of rivers and lakes.") updated_picture = agent.run("Transform the image in `picture` to add an island to it.", picture=picture)
This can be helpful when the model is unable to understand your request and mixes tools. An example would be:
agent.run("Draw me the picture of a capybara swimming in the sea")
Here, the model could interpret in two ways:
- Have the
text-to-image
generate a capybara swimming in the sea - Or, have the
text-to-image
generate capybara, then use theimage-transformation
tool to have it swim in the sea
In case you would like to force the first scenario, you could do so by passing it the prompt as an argument:
agent.run("Draw me a picture of the `prompt`", prompt="a capybara swimming in the sea")
Chat-based execution (chat)
The agent also has a chat-based approach, using the chat() method:
agent.chat("Generate a picture of rivers and lakes")
agent.chat("Transform the picture so that there is a rock in there")
This is an interesting approach when you want to keep the state across instructions. It’s better for experimentation,
but will tend to be much better at single instructions rather than complex instructions (which the run()
method is better at handling).
This method can also take arguments if you would like to pass non-text types or specific prompts.
⚠️ Remote execution
For demonstration purposes and so that this can be used with all setups, we have created remote executors for several
of the default tools the agent has access. These are created using
inference endpoints. To see how to set up remote executors tools yourself,
we recommend reading the custom tool guide.
In order to run with remote tools, specifying remote=True
to either run() or chat() is sufficient.
For example, the following command could be run on any device efficiently, without needing significant RAM or GPU:
agent.run("Draw me a picture of rivers and lakes", remote=True)
The same can be said for chat():
agent.chat("Draw me a picture of rivers and lakes", remote=True)
What’s happening here? What are tools, and what are agents?
Agents
The “agent” here is a large language model, and we’re prompting it so that it has access to a specific set of tools.
LLMs are pretty good at generating small samples of code, so this API takes advantage of that by prompting the
LLM gives a small sample of code performing a task with a set of tools. This prompt is then completed by the
task you give your agent and the description of the tools you give it. This way it gets access to the doc of the
tools you are using, especially their expected inputs an