Skip to content Skip to footer
0 items - $0.00 0

Show HN: Factorio Learning Environment – Agents Build Factories by noddybear

Show HN: Factorio Learning Environment – Agents Build Factories by noddybear

21 Comments

  • Post Author
    p10jkle
    Posted March 11, 2025 at 12:15 pm

    Wow, fascinating. I wonder if in a few years every in-game opponent will just be an LLM with access to a game-controlling API like the one you've created.

    Did you find there are particular types of tasks that the models struggle with? Or does difficulty mostly just scale with the number of items they need to place?

  • Post Author
    leetbulb
    Posted March 11, 2025 at 12:32 pm

    Very cool project. Lovely diagrams.

  • Post Author
    zelias
    Posted March 11, 2025 at 12:38 pm

    Fantastic! Now I can sit back and watch the factory grow itself!

    More seriously, I think this is a great "next step" in the evolution of benchmarks. Contemporary benchmarks are really just standardized tests, but the Factorio problem set presents an unbounded canvas of creativity for solutioning.

  • Post Author
    artemonster
    Posted March 11, 2025 at 12:46 pm

    Diagonal belts are signs of evil

  • Post Author
    tux3
    Posted March 11, 2025 at 12:47 pm

    For the real frontier benchmark, at the edge of what humans will put up with, install Pyanodon's mod. The scale of it puts a real strain on your organizational skills. Overbuilding, underbuilding, or bad planning can all cause significant pain down the line as the factory risks becoming an unmanageable, tangled mess with no sane capacity for expansion. It's a real test of executive function and organization.

    For something humans will definitely not put up with, install PyBlock in Hard Mode. I suspect that benchmark will not fall anytime soon. It is borderline impossible without superhuman patience.

  • Post Author
    keeganpoppen
    Posted March 11, 2025 at 12:48 pm

    this is an absolutely fascinating project– wow! i am going to have to fire up Factorio again and try it out! the implications of what the experience of playing games is like in this new LLM era / world order are fascinating.

  • Post Author
    jharohit
    Posted March 11, 2025 at 12:51 pm

    Everytime a paper like this comes out, I always have 1 question – How do they control the game using the LLMs? How does the control-feedback loop work? WHat tools, software and APIs they use to do it on Mac or Windows?

  • Post Author
    moconnor
    Posted March 11, 2025 at 12:52 pm

    Incredible idea and execution, very interesting results. Genuinely: what a time to be alive!

  • Post Author
    mentalgear
    Posted March 11, 2025 at 12:57 pm

    > [LLMs] yet are unable to operate effectively in constrained environments, reflecting limitations in error analysis

    This reflects my experience with gen-LLM coding, where LLMs keep trying to do the same thing in a loop.

  • Post Author
    sturza
    Posted March 11, 2025 at 1:00 pm

    > 1. Coding skill predicts performance

    > Models with stronger coding abilities (Claude 3.5-Sonnet, GPT-4o) achieved higher Production Scores and completed more lab tasks. Claude outperformed others with a PS of 293,206 and 28 milestones, progressing beyond early-game resource extraction.

  • Post Author
    alexop
    Posted March 11, 2025 at 1:01 pm

    its funny how video games are the hardest benchmark that humanity has for ai

  • Post Author
    loveparade
    Posted March 11, 2025 at 1:04 pm

    "put the right signals into my train network"

    Not even humans can pass this benchmark.

  • Post Author
    WJW
    Posted March 11, 2025 at 1:12 pm

    Very cool and also pretty expected results tbh. Some thoughts:

    Factorio is a game that requires SIGNIFICANT amounts of thinking ahead, often requiring investments into things that won't pay off until much later and which might even significantly hamper initial development. Building a main bus vs spaghetti belts is one of the obvious examples here.

    Humans with a little bit of experience playing factorio know that while building 1 item/s of some new resource is good, the game is about eventually building thousands of the new item. Until the LLM learns not to be short term minded it will probably build itself into a corner very quickly.

    It is kind of amazing that these models manage to figure out a strategy at all, considering the game is not in their training set. That said, the current research goals are not very good IMO. Building the largest possible base has the predictable result of the AI building a humongous belt loop covering much of the map. A much better target would be the "standard" goal of SPM.

    I think 99% of Factorio could be "solved" with GOFAI algorithms from the 80s and enough processing power. Set up a goal like 10k SPM and then work backwards towards how many of each resource you need, then recursively figure out fastest way to set up the production for each subresource using standard optimization algorithms from OR. No LLMs needed.

  • Post Author
    myrmidon
    Posted March 11, 2025 at 1:16 pm

    Fascinating. Would have loved to see more pictures of the bigger factories– or is the zig-zag belt into plastic production currently the best result?

    I think this very clearly illustrates a big weakness of current LLMs– humans might struggle just as much at first, but are able to specialize and adapt to a task, while LLMs can't– yet.

    I'm expecting even greater improvements from figuring out online learning/adaptation than what we got from chain-of-thought approaches.

    Do you think the "API" to interact with the game is a big obstacle, compared to a human interacting with the game via monitor? Did anyone try to interact with the game via this API, and how does human effort measure up to the AIs?

  • Post Author
    sc68cal
    Posted March 11, 2025 at 1:26 pm

    The diagonal belts and diagonal pipes are especially cursed.

  • Post Author
    mritchie712
    Posted March 11, 2025 at 1:29 pm

    have you tried sonnet 3.7 yet? guessing these aren't cheap evals to run.

    leaderboard: https://jackhopkins.github.io/factorio-learning-environment/…

  • Post Author
    kevmo314
    Posted March 11, 2025 at 1:39 pm

    Does it provide screenshots of the game state? I, too, would struggle to play the game pretty effectively if I could not visually see the game.

  • Post Author
    danso
    Posted March 11, 2025 at 1:41 pm

    Tangentially: been wondering when we’d ever see the breakthroughs in LLMs trickle down to making better adversarial game AIs. Haven’t tried Civ 7 b/c of its terrible reviews, but I’d happily buy in if there were AIs that were more human-like and varied in their scheming

  • Post Author
    cgannett
    Posted March 11, 2025 at 1:55 pm

    I wonder if anyone has done something similar with Dwarf Fortress

  • Post Author
    philipwhiuk
    Posted March 11, 2025 at 1:55 pm

    It's great to see that LLMs too, struggle with oil production.

  • Post Author
    spieswl
    Posted March 11, 2025 at 2:01 pm

    Fantastic idea.

    It seems like there are a lot of interesting experiments to be had here. The lab-play scenarios having a time-related component seems like a good idea, I assume most Factorio players that keep biters on treat them as a combined temporal-spatial constraint, so you have a sort-of proxy comparison to a real game situation when you put the agents on a timer.

    I like the way that the framework design is testing different things than micromanagement proficiency, such as what we have seen in DOTA 2 or StarCraft 2 experiments. Notably, severe worker micromanagement (in the case of the latter game) becomes a way to squeak out extra minerals when you have infinite APM available. This is an interesting learned behavior in a narrow context, but that tactic is really control intensive and has a high chance for even pro players to screw it up when attempting to do so. It also doesn't seemingly give additional insight into an agent's longer-term planning, execution, and analytical performance. FLE seems way more interesting as a higher-level "thinking" evaluation framework, with all that in mind.

    Any plans for layout optimization benchmarks? As in, start with a given factory cell with X inputs and Y outputs, and optimize its performance.

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.