1Independent,
2Anthropic
*Equal contribution
Mine 16 Iron Oreper minute
Smelt 16 Iron Platesper minute
Make 16 Iron Gearsper minute
Extract 250 Petroleum Gasper minute
Refine 16 Sulfurper minute
Make 16 Plastic barsper minute
Build the largest possible factory
Build the largest possible factory
Build the largest possible factory
Claude Sonnet 3.5 builds factories
Abstract
Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations.
We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization.
FLE provides open-ended and exponentially scaling challenges – from basic automation to complex factories processing millions of resource units per second.
We provide two settings:
- Lab-play consisting of 24 structured tasks with fixed resources.
- Open-play with the unbounded task of building the largest factory from scratch on a procedurally generated map.
We demonstrate across both settings that models still lack strong spatial reasoning.
In lab-play, we find that LLMs exhibit promising short-horizon skills, yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis.
In open-play, while LLMs discover automation strategies that improve growth (e.g electric-powered drilling), they fail to achieve complex automation (e.g electronic-circuit manufacturing).
Introduction
Large Language Models (LLMs) have demonstrated remarkable capabilities at solving complex question-answer (QA) problems, saturating benchmarks in factual recollection, reasoning and code generation.
Benchmark saturation presents a critical challenge for the AI research community: how do we meaningfully evaluate and differentiate increasingly capable models?
We introduce
21 Comments
p10jkle
Wow, fascinating. I wonder if in a few years every in-game opponent will just be an LLM with access to a game-controlling API like the one you've created.
Did you find there are particular types of tasks that the models struggle with? Or does difficulty mostly just scale with the number of items they need to place?
leetbulb
Very cool project. Lovely diagrams.
zelias
Fantastic! Now I can sit back and watch the factory grow itself!
More seriously, I think this is a great "next step" in the evolution of benchmarks. Contemporary benchmarks are really just standardized tests, but the Factorio problem set presents an unbounded canvas of creativity for solutioning.
artemonster
Diagonal belts are signs of evil
tux3
For the real frontier benchmark, at the edge of what humans will put up with, install Pyanodon's mod. The scale of it puts a real strain on your organizational skills. Overbuilding, underbuilding, or bad planning can all cause significant pain down the line as the factory risks becoming an unmanageable, tangled mess with no sane capacity for expansion. It's a real test of executive function and organization.
For something humans will definitely not put up with, install PyBlock in Hard Mode. I suspect that benchmark will not fall anytime soon. It is borderline impossible without superhuman patience.
keeganpoppen
this is an absolutely fascinating project– wow! i am going to have to fire up Factorio again and try it out! the implications of what the experience of playing games is like in this new LLM era / world order are fascinating.
jharohit
Everytime a paper like this comes out, I always have 1 question – How do they control the game using the LLMs? How does the control-feedback loop work? WHat tools, software and APIs they use to do it on Mac or Windows?
moconnor
Incredible idea and execution, very interesting results. Genuinely: what a time to be alive!
mentalgear
> [LLMs] yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis
This reflects my experience with gen-LLM coding, where LLMs keep trying to do the same thing in a loop.
sturza
> 1. Coding skill predicts performance
> Models with stronger coding abilities (Claude 3.5-Sonnet, GPT-4o) achieved higher Production Scores and completed more lab tasks. Claude outperformed others with a PS of 293,206 and 28 milestones, progressing beyond early-game resource extraction.
alexop
its funny how video games are the hardest benchmark that humanity has for ai
loveparade
"put the right signals into my train network"
Not even humans can pass this benchmark.
WJW
Very cool and also pretty expected results tbh. Some thoughts:
Factorio is a game that requires SIGNIFICANT amounts of thinking ahead, often requiring investments into things that won't pay off until much later and which might even significantly hamper initial development. Building a main bus vs spaghetti belts is one of the obvious examples here.
Humans with a little bit of experience playing factorio know that while building 1 item/s of some new resource is good, the game is about eventually building thousands of the new item. Until the LLM learns not to be short term minded it will probably build itself into a corner very quickly.
It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set. That said, the current research goals are not very good IMO. Building the largest possible base has the predictable result of the AI building a humongous belt loop covering much of the map. A much better target would be the "standard" goal of SPM.
I think 99% of Factorio could be "solved" with GOFAI algorithms from the 80s and enough processing power. Set up a goal like 10k SPM and then work backwards towards how many of each resource you need, then recursively figure out fastest way to set up the production for each subresource using standard optimization algorithms from OR. No LLMs needed.
myrmidon
Fascinating. Would have loved to see more pictures of the bigger factories– or is the zig-zag belt into plastic production currently the best result?
I think this very clearly illustrates a big weakness of current LLMs– humans might struggle just as much at first, but are able to specialize and adapt to a task, while LLMs can't– yet.
I'm expecting even greater improvements from figuring out online learning/adaptation than what we got from chain-of-thought approaches.
Do you think the "API" to interact with the game is a big obstacle, compared to a human interacting with the game via monitor? Did anyone try to interact with the game via this API, and how does human effort measure up to the AIs?
sc68cal
The diagonal belts and diagonal pipes are especially cursed.
mritchie712
have you tried sonnet 3.7 yet? guessing these aren't cheap evals to run.
leaderboard: https://jackhopkins.github.io/factorio-learning-environment/…
kevmo314
Does it provide screenshots of the game state? I, too, would struggle to play the game pretty effectively if I could not visually see the game.
danso
Tangentially: been wondering when we’d ever see the breakthroughs in LLMs trickle down to making better adversarial game AIs. Haven’t tried Civ 7 b/c of its terrible reviews, but I’d happily buy in if there were AIs that were more human-like and varied in their scheming
cgannett
I wonder if anyone has done something similar with Dwarf Fortress
philipwhiuk
It's great to see that LLMs too, struggle with oil production.
spieswl
Fantastic idea.
It seems like there are a lot of interesting experiments to be had here. The lab-play scenarios having a time-related component seems like a good idea, I assume most Factorio players that keep biters on treat them as a combined temporal-spatial constraint, so you have a sort-of proxy comparison to a real game situation when you put the agents on a timer.
I like the way that the framework design is testing different things than micromanagement proficiency, such as what we have seen in DOTA 2 or StarCraft 2 experiments. Notably, severe worker micromanagement (in the case of the latter game) becomes a way to squeak out extra minerals when you have infinite APM available. This is an interesting learned behavior in a narrow context, but that tactic is really control intensive and has a high chance for even pro players to screw it up when attempting to do so. It also doesn't seemingly give additional insight into an agent's longer-term planning, execution, and analytical performance. FLE seems way more interesting as a higher-level "thinking" evaluation framework, with all that in mind.
Any plans for layout optimization benchmarks? As in, start with a given factory cell with X inputs and Y outputs, and optimize its performance.