Why your MARL agents suck in the real world (and how to fix it)

Ever trained multi-agent AI in self-play? You end up with agents that are brilliant at beating each other, but totally brittle. They overfit to their partner's weird quirks and fail the moment you pair them with a new agent (or a human).

https://preview.redd.it/xk19s06tt31g1.jpg?width=1200&format=pjpg&auto=webp&s=2e422f791a98f217087f1145f7e443fc49e65c5c

A new post about Rational Policy Gradient (RPG) tackles this "self-sabotage."

The TL;DR:

  • Problem: Standard self-play trains agents to be the best-response to their partner's current policy. This leads to brittle, co-adapted strategies.
  • Solution (RPG): Train the agent to be a robust best-response to its partner's future rational policy.
  • The Shift: It's like changing the goal from "How do I beat what you're doing now?" to "What's a good general strategy, assuming you'll also act rationally?"

This method forces agents to learn robust, generalized policies. It was tested on Hanabi (a notoriously hard co-op benchmark) and found it produces agents that are far more robust and can successfully cooperate with a diverse set of new partners.

Stops agents from learning "secret handshakes" and forces them to learn the actual game. Pretty smart fix for a classic MARL headache.

Reference:

Instruction Tips

Leave a Reply