Skip to content Skip to footer
R1 Computer Use by mountainriver

R1 Computer Use by mountainriver

10 Comments

  • Post Author
    mountainriver
    Posted February 6, 2025 at 8:02 pm

    Hey HN,

    We are working to apply the ideas of R1 to computer use. The primary struggle is creating reliable neural reward models since hard-verification rewards are not available at scale in GUI interactions.

    Our team is currently deep in the weeds of collecting reasoning annotation data for GUI interfaces to train a reliable reward model.

    We would love all thoughts, feedback, and collaborations!

  • Post Author
    llama-mini
    Posted February 6, 2025 at 8:46 pm

    It seems a placeholder for now? No content? Right?

  • Post Author
    falcor84
    Posted February 6, 2025 at 8:50 pm

    > @software{r1_computer_use,
    title = {R1-Computer-Use: Reasoning-First Computer Interaction},
    author = {Barker, Patrick},
    year = {2025},
    url = {https://github.com/agentsea/r1-computer-use},
    }

    Sorry to be a party-pooper, but does it really make sense to add a citation when you don't have fully working code yet, let alone a paper about it?

  • Post Author
    fkyoureadthedoc
    Posted February 6, 2025 at 8:59 pm

    This is the type of post some VP at my company sees and starts telling people that R1 can use computer and then I have to be like "well actually" to 25 people.

    Computer use is pretty exciting stuff in general though, good luck

  • Post Author
    refulgentis
    Posted February 6, 2025 at 9:04 pm

    Free advice (though, worth less than free, because A) it's unsolicited B) it's saying "don't do it")

    TL;DR:

    – Turns out that if you do UXR, even if computer use is 100% successful in the action execution, and there's no latency, people don't use it. (interesting to me is, the core demo was buying airline tickets, and so is OpenAI's. no one would defer to a computer on that, for humanist / design reasons)

    – You're not going to be able out-do model companies on building models, they have too much funds.

    – Try writing GUI-based integration tests. Then imagine an LLM, miraculously, always chooses the right route. Does the UX look good?

    – Note the reasoning models are worse at tool calling. It's very, very, VERY stark when you have Claude next to o1/4o. OpenAI also owns up to this in the o3-mini paper, though its not under a blaring red line headline or phrased that straightforwardly.

    – Why is that? You're fighting against the current when you're trying to teach the next token predictor to throw a bunch of text out there to <think>, then generate perfectly correct JSON/python/whatever given N tools.

    CLI, though….

  • Post Author
    crazygringo
    Posted February 6, 2025 at 9:07 pm

    I can't wait for something like this to be built.

    People have tons of workflows that involve a lot of clicks and typing in response to data that are too difficult or one-off to automate with fragile macros.

    But if my computer can quickly realize that I'm deleting every odd-numbered page of a PDF, or renaming every file to add a prefix, or following each link on a website and saving an image… and then just instantly automate the next 100 times… that's going to be huge!

  • Post Author
    iiJDSii
    Posted February 6, 2025 at 9:24 pm

    What does your perception look like, are you using raw screenshots? GUI snapshots? Vision is very difficult for these, and snapshots are incomplete, is what I've found in some earlier experiments.

  • Post Author
    mkagenius
    Posted February 6, 2025 at 9:38 pm

    Training a base model just for computer use seems like an overkill as normal reasoning model like o3 for planning + a vision model like gemini-flash is good enough[1] without being trained specifically for computer use.

    But if you still want to try out this path, Google has made the screenQA dataset(rico) available[2] along with bounding boxes.

    1. A framework to use local/hosted models for android use/control – https://github.com/BandarLabs/clickclickclick

    2. https://github.com/google-research-datasets/screen_qa

  • Post Author
    emregucerr
    Posted February 6, 2025 at 9:39 pm

    i wonder how good is R1 at counting pixels from a screenshot. what enabled claude and OAI's CUA to develop computer use was being able to precisely give x-y coordinates of a click location.

    also, how big of a gain to have reasoning for computer use? i feel like reasoning unlocks a lot when there is a single complex question but not so much better at taking actions in a long term plan.

  • Post Author
    3s
    Posted February 6, 2025 at 9:49 pm

    Are people concerned about the privacy implications of computer use at all? This is why I haven’t been using Claude computer use personally. Somehow the idea of sending everything I do on my computer to a random third party seems creepy. There are a lot of applications of AI (rewind comes to mind) that I simply cannot accept the idea of sharing my screen with

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.