[D] How to benchmark open-ended, real-world goal achievement by computer-using LLMs?

By skyforbes Nov 27, 2025 No Comments

GPVal takes care of measuring agent performance on economically valuable tasks. We are working on the AI Village, where we try to see how we can explore, and possibly evaluate, how groups of persistent agents do at open-ended, real-world tasks in general. We're currently running all the frontier LLMs (OpenAI, Anthropic, eepMind) with their own computer, internet access, and a group chat, and we give them goals like raising money for charity, organizing an event, or selling t-shirts online. We had the agents try to invent their own benchmark for themselves, but this led to them writing a lot of words, and doing almost no actions, but declaring themselves amazing at the benchmark. Gemini 2.5 Pro did manage to make something like a podcast and a "documentary" but these were pretty rudimentary attempts.

I'm curious what ideas people here might have. Say you had a persistent multi-agent system, where each LLM is using a computer and trying to achieve goals: What goals would be interesting to give them? How would you compare the agents? What tools would you give them? What are the main things you'd be excited to explore?

Some examples of insights we got so far, in case that helps kick-start conversation 🙂

– Hallucinations and lack of situational awareness have hampered o3 a lot, resulting in it performing quite badly on goals that require real-world action. Meanwhile, it does really well on "talking" goals like winning the most debates during a formal debate season.

– Computer use skills combined with temperament often lead Gemini 2.5 Pro to give up on achieving goals while other (sometimes less capable agents) keep working regardless. It seems to disproportionally assign its own errors (e.g. misclicks) to the environment and then decide it's all hopeless.

– ocument sharing is surprisingly hard, and so is playing online games. Meanwhile, they've made nice websites for themselves and do well on Twitter (if given an account and reminded of its existence). I'm not sure entirely sure why this pattern is emerging.

By skyforbes

MachineLearning

[R] Layer-0 heads that pre-bias hedging over facts in GPT-2 (replicated in Mistral-7B) — code + DOI

skyforbes Nov 27, 2025

MachineLearning

[D] Moral Uncertainty Around Emerging AI Introspection

skyforbes Nov 27, 2025

MachineLearning

[D][P] PKBoost v2 is out! An entropy-guided boosting library with a focus on drift adaptation and multiclass/regression support.

skyforbes Nov 27, 2025

[D] How to benchmark open-ended, real-world goal achievement by computer-using LLMs?

Like this:

By skyforbes

Leave a ReplyCancel reply

You Missed

How to make it stop negging

AI content approval dropped 60% → 26% in 2 years. The D.E.P.T.H Method fixed it.

AI Doesn’t Fail Because of Models; It Fails Because of the Network

What game mechanic was not explained (not clear enough) so you missed using it until FAR too late in the game?

Archives

[D] How to benchmark open-ended, real-world goal achievement by computer-using LLMs?

Like this:

By skyforbes

Related Posts

[R] Layer-0 heads that pre-bias hedging over facts in GPT-2 (replicated in Mistral-7B) — code + DOI

[D] Moral Uncertainty Around Emerging AI Introspection

[D][P] PKBoost v2 is out! An entropy-guided boosting library with a focus on drift adaptation and multiclass/regression support.

Leave a ReplyCancel reply

You Missed

How to make it stop negging

AI content approval dropped 60% → 26% in 2 years. The D.E.P.T.H Method fixed it.

AI Doesn’t Fail Because of Models; It Fails Because of the Network

What game mechanic was not explained (not clear enough) so you missed using it until FAR too late in the game?