Show HN: Factorio Learning Environment – Agents Build Factories by noddybear

ByHackTech 4 days ago

21Comments

Share This Article

Sed ut perspiciatis unde.

Send to HN

¹Independent,
²Anthropic

^*Equal contribution

Mine 16 Iron Oreper minute

Smelt 16 Iron Platesper minute

Make 16 Iron Gearsper minute

Extract 250 Petroleum Gasper minute

Refine 16 Sulfurper minute

Make 16 Plastic barsper minute

Build the largest possible factory

Claude Sonnet 3.5 builds factories

Abstract

Large Language Models (LLMs) are rapidly saturating existing benchmarks, necessitating new open-ended evaluations.
We introduce the Factorio Learning Environment (FLE), based on the game of Factorio, that tests agents in long-term planning, program synthesis, and resource optimization.

FLE provides open-ended and exponentially scaling challenges – from basic automation to complex factories processing millions of resource units per second.
We provide two settings:

Lab-play consisting of 24 structured tasks with fixed resources.
Open-play with the unbounded task of building the largest factory from scratch on a procedurally generated map.

We demonstrate across both settings that models still lack strong spatial reasoning.
In lab-play, we find that LLMs exhibit promising short-horizon skills, yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis.
In open-play, while LLMs discover automation strategies that improve growth (e.g electric-powered drilling), they fail to achieve complex automation (e.g electronic-circuit manufacturing).

Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities at solving complex question-answer (QA) problems, saturating benchmarks in factual recollection, reasoning and code generation.
Benchmark saturation presents a critical challenge for the AI research community: how do we meaningfully evaluate and differentiate increasingly capable models?

We introduce

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (21)

21 Comments

Post Author

p10jkle

Posted March 11, 2025 at 12:15 pm

Wow, fascinating. I wonder if in a few years every in-game opponent will just be an LLM with access to a game-controlling API like the one you've created.

Did you find there are particular types of tasks that the models struggle with? Or does difficulty mostly just scale with the number of items they need to place?

0Likes Log in to Reply
Post Author

leetbulb

Posted March 11, 2025 at 12:32 pm

Very cool project. Lovely diagrams.

0Likes Log in to Reply
Post Author

zelias

Posted March 11, 2025 at 12:38 pm

Fantastic! Now I can sit back and watch the factory grow itself!

More seriously, I think this is a great "next step" in the evolution of benchmarks. Contemporary benchmarks are really just standardized tests, but the Factorio problem set presents an unbounded canvas of creativity for solutioning.

0Likes Log in to Reply
Post Author

artemonster

Posted March 11, 2025 at 12:46 pm

Diagonal belts are signs of evil

0Likes Log in to Reply
Post Author

tux3

Posted March 11, 2025 at 12:47 pm

For the real frontier benchmark, at the edge of what humans will put up with, install Pyanodon's mod. The scale of it puts a real strain on your organizational skills. Overbuilding, underbuilding, or bad planning can all cause significant pain down the line as the factory risks becoming an unmanageable, tangled mess with no sane capacity for expansion. It's a real test of executive function and organization.

For something humans will definitely not put up with, install PyBlock in Hard Mode. I suspect that benchmark will not fall anytime soon. It is borderline impossible without superhuman patience.

0Likes Log in to Reply
Post Author

keeganpoppen

Posted March 11, 2025 at 12:48 pm

this is an absolutely fascinating project– wow! i am going to have to fire up Factorio again and try it out! the implications of what the experience of playing games is like in this new LLM era / world order are fascinating.

0Likes Log in to Reply
Post Author

jharohit

Posted March 11, 2025 at 12:51 pm

Everytime a paper like this comes out, I always have 1 question – How do they control the game using the LLMs? How does the control-feedback loop work? WHat tools, software and APIs they use to do it on Mac or Windows?

0Likes Log in to Reply
Post Author

moconnor

Posted March 11, 2025 at 12:52 pm

Incredible idea and execution, very interesting results. Genuinely: what a time to be alive!

0Likes Log in to Reply
Post Author

mentalgear

Posted March 11, 2025 at 12:57 pm

> [LLMs] yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis

This reflects my experience with gen-LLM coding, where LLMs keep trying to do the same thing in a loop.

0Likes Log in to Reply
Post Author

sturza

Posted March 11, 2025 at 1:00 pm

> 1. Coding skill predicts performance

> Models with stronger coding abilities (Claude 3.5-Sonnet, GPT-4o) achieved higher Production Scores and completed more lab tasks. Claude outperformed others with a PS of 293,206 and 28 milestones, progressing beyond early-game resource extraction.

0Likes Log in to Reply
Post Author

alexop

Posted March 11, 2025 at 1:01 pm

its funny how video games are the hardest benchmark that humanity has for ai

0Likes Log in to Reply
Post Author

loveparade

Posted March 11, 2025 at 1:04 pm

"put the right signals into my train network"

Not even humans can pass this benchmark.

0Likes Log in to Reply
Post Author

WJW

Posted March 11, 2025 at 1:12 pm

Very cool and also pretty expected results tbh. Some thoughts:

Factorio is a game that requires SIGNIFICANT amounts of thinking ahead, often requiring investments into things that won't pay off until much later and which might even significantly hamper initial development. Building a main bus vs spaghetti belts is one of the obvious examples here.

Humans with a little bit of experience playing factorio know that while building 1 item/s of some new resource is good, the game is about eventually building thousands of the new item. Until the LLM learns not to be short term minded it will probably build itself into a corner very quickly.

It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set. That said, the current research goals are not very good IMO. Building the largest possible base has the predictable result of the AI building a humongous belt loop covering much of the map. A much better target would be the "standard" goal of SPM.

I think 99% of Factorio could be "solved" with GOFAI algorithms from the 80s and enough processing power. Set up a goal like 10k SPM and then work backwards towards how many of each resource you need, then recursively figure out fastest way to set up the production for each subresource using standard optimization algorithms from OR. No LLMs needed.

0Likes Log in to Reply
Post Author

myrmidon

Posted March 11, 2025 at 1:16 pm

Fascinating. Would have loved to see more pictures of the bigger factories– or is the zig-zag belt into plastic production currently the best result?

I think this very clearly illustrates a big weakness of current LLMs– humans might struggle just as much at first, but are able to specialize and adapt to a task, while LLMs can't– yet.

I'm expecting even greater improvements from figuring out online learning/adaptation than what we got from chain-of-thought approaches.

Do you think the "API" to interact with the game is a big obstacle, compared to a human interacting with the game via monitor? Did anyone try to interact with the game via this API, and how does human effort measure up to the AIs?

0Likes Log in to Reply
Post Author

sc68cal

Posted March 11, 2025 at 1:26 pm

The diagonal belts and diagonal pipes are especially cursed.

0Likes Log in to Reply
Post Author

mritchie712

Posted March 11, 2025 at 1:29 pm

have you tried sonnet 3.7 yet? guessing these aren't cheap evals to run.

leaderboard: https://jackhopkins.github.io/factorio-learning-environment/…

0Likes Log in to Reply
Post Author

kevmo314

Posted March 11, 2025 at 1:39 pm

Does it provide screenshots of the game state? I, too, would struggle to play the game pretty effectively if I could not visually see the game.

0Likes Log in to Reply
Post Author

danso

Posted March 11, 2025 at 1:41 pm

Tangentially: been wondering when we’d ever see the breakthroughs in LLMs trickle down to making better adversarial game AIs. Haven’t tried Civ 7 b/c of its terrible reviews, but I’d happily buy in if there were AIs that were more human-like and varied in their scheming

0Likes Log in to Reply
Post Author

cgannett

Posted March 11, 2025 at 1:55 pm

I wonder if anyone has done something similar with Dwarf Fortress

0Likes Log in to Reply
Post Author

philipwhiuk

Posted March 11, 2025 at 1:55 pm

It's great to see that LLMs too, struggle with oil production.

0Likes Log in to Reply
Post Author

spieswl

Posted March 11, 2025 at 2:01 pm

Fantastic idea.

It seems like there are a lot of interesting experiments to be had here. The lab-play scenarios having a time-related component seems like a good idea, I assume most Factorio players that keep biters on treat them as a combined temporal-spatial constraint, so you have a sort-of proxy comparison to a real game situation when you put the agents on a timer.

I like the way that the framework design is testing different things than micromanagement proficiency, such as what we have seen in DOTA 2 or StarCraft 2 experiments. Notably, severe worker micromanagement (in the case of the latter game) becomes a way to squeak out extra minerals when you have infinite APM available. This is an interesting learned behavior in a narrow context, but that tactic is really control intensive and has a high chance for even pro players to screw it up when attempting to do so. It also doesn't seemingly give additional insight into an agent's longer-term planning, execution, and analytical performance. FLE seems way more interesting as a higher-level "thinking" evaluation framework, with all that in mind.

Any plans for layout optimization benchmarks? As in, start with a given factory cell with X inputs and Y outputs, and optimize its performance.

0Likes Log in to Reply

Show HN: Factorio Learning Environment – Agents Build Factories by noddybear

Show HN: Factorio Learning Environment – Agents Build Factories by noddybear

Share This Article

Newsletter

Abstract

Introduction

HackTech

21 Comments

p10jkle

leetbulb

zelias

artemonster

tux3

keeganpoppen

jharohit

moconnor

mentalgear

sturza

alexop

loveparade

WJW

myrmidon

sc68cal

mritchie712

kevmo314

danso

cgannett

philipwhiuk

spieswl

Leave a comment Cancel reply

Editor's Choice

Show HN: Factorio Learning Environment – Agents Build Factories by noddybear

Show HN: Factorio Learning Environment – Agents Build Factories by noddybear

Share This Article

Newsletter

Abstract

Introduction

21 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter