tuner module supports tuning agents within such multi-agent settings, allowing you to optimize a subset of agents while other agents serve as the environment or opponents.
This tutorial builds on the concepts introduced in Overview and Agent Reinforcement Learning. Make sure you are familiar with the core components (Task Dataset, Workflow Function, Judge Function) and the basic tuning workflow before proceeding.
We will use a simplified werewolf game as the running example throughout this tutorial. In this game, 7 players (2 werewolves, 3 villagers, 1 seer, 1 witch) interact over multiple rounds of discussion and voting. The goal is to train the werewolf players to improve their win rate.
Key Differences from Single-Agent Tuning
Tuning in a multi-agent system introduces several additional considerations compared to single-agent tuning:| Aspect | Single-Agent Tuning | Multi-Agent Tuning |
|---|---|---|
| Workflow Function | One agent processes a task | Multiple agents interact with each other |
| Model Assignment | One trainable model | Trainable model + auxiliary models for other agents |
| Reward Design | Based on individual agent output | Based on collective outcome (e.g., game result) |
| Judge Function | Evaluates single response | Should be integrated into the workflow |
| Complexity | Simpler, shorter interactions | Longer episodes with multi-turn interactions |
tuner only updates the weights of the trainable model, while auxiliary models remain frozen during training.
Design the Workflow Function
In multi-agent tuning, the workflow function orchestrates the entire multi-agent interaction. It creates all agents, assigns models, runs the interaction, and returns the result. The workflow function accepts two additional parameters compared to single-agent tuning:model: The trainable model, assigned to the agents you want to tune.auxiliary_models: A dictionary of auxiliary models for the remaining agents.
model is assigned to the werewolf players, while all other roles (villagers, seer, witch) use the frozen auxiliary_models["participant"]. During tuning, only the werewolf model’s weights are updated.
Assign Models to Agents
The core pattern for multi-agent tuning is to selectively assign the trainable model to the agents you want to optimize, and auxiliary models to all other agents.Tuning a Specific Role
In the werewolf example above, we train only the werewolf players by checking the role:Tuning Multiple Roles Simultaneously
You can also train multiple roles at once using the same trainable model. For example, to train all good guy roles (villagers, seer, witch) instead of werewolves:Design Rewards for Multi-Agent Systems
Reward design is especially important in multi-agent settings because the outcome depends on the interactions between all agents, not just a single agent’s response. In the werewolf game, the reward is naturally derived from the game outcome — whether the trainable team wins or loses:judge_func=None to the tune() function.
Handle Errors Gracefully
Multi-agent interactions are inherently more complex and error-prone than single-agent tasks. It is important to handle exceptions in the workflow function to prevent training failures:Configuration & Tuning
After implementing the workflow function, configure the tuning process. The key difference from single-agent tuning is the addition ofauxiliary_models in the tune() call:
auxiliary_models: A dictionary mapping model names toTunerModelConfig. The keys must match those used inauxiliary_modelsparameter of your workflow function.group_size: In multi-agent settings, each task episode involves multiple agents interacting over many turns, making each rollout more expensive. Consider balancing group size with available compute resources.model.max_model_len: Multi-agent interactions typically produce longer conversation histories. Set a sufficiently largemax_model_lento accommodate the full interaction.
Switching Training Target via workflow_args
In the werewolf game, you may want to train werewolves in one run and good guys in another, without changing the code. The tuner supports passing extra arguments to the workflow function through the task’s workflow_args field.
workflow_args in your YAML configuration file:
Complete Example
Werewolf Game Training Example
A full end-to-end example training werewolf agents in a 7-player social deduction game — achieving ~85% win rate (up from ~50%) with configurable training targets.
Best Practices
Start simple and scale up
Start simple and scale up
Begin with a small number of agents and short interaction episodes. Scale up once you confirm the setup works correctly.
Validate locally first
Validate locally first
Run your workflow function locally with a few test tasks before launching the full tuning process to catch bugs early.
Use a stronger auxiliary model
Use a stronger auxiliary model
Using a stronger model for auxiliary agents provides a more challenging and stable environment for the trainable agents, which generally leads to better training outcomes.
Monitor with logging
Monitor with logging
Add the
logger parameter to your workflow function (see Agent Reinforcement Learning — Runtime Monitoring) to debug multi-agent interactions during tuning.Design clear reward signals
Design clear reward signals
In multi-agent settings, sparse rewards (e.g., only win/loss at the end) can slow training. Consider adding intermediate reward signals when possible.
Handle long episodes
Handle long episodes
Multi-agent interactions can produce very long conversation histories. Set
max_model_len appropriately and consider adding timeouts in your workflow to avoid excessively long episodes.