Ever trained multi-agent AI in self-play? You end up with agents that are brilliant at beating each other, but totally brittle. They overfit to their partner's weird quirks and fail the moment you pair them with a new agent (or a human).
A new post about Rational Policy Gradient (RPG) tackles this "self-sabotage."
The TL;DR:
- Problem: Standard self-play trains agents to be the best-response to their partner's current policy. This leads to brittle, co-adapted strategies.
- Solution (RPG): Train the agent to be a robust best-response to its partner's future rational policy.
- The Shift: It's like changing the goal from "How do I beat what you're doing now?" to "What's a good general strategy, assuming you'll also act rationally?"
This method forces agents to learn robust, generalized policies. It was tested on Hanabi (a notoriously hard co-op benchmark) and found it produces agents that are far more robust and can successfully cooperate with a diverse set of new partners.
Stops agents from learning "secret handshakes" and forces them to learn the actual game. Pretty smart fix for a classic MARL headache.
Reference: