Multi-Agent Tuning

In many real-world applications, agents do not work in isolation — they interact, cooperate, or compete with each other in multi-agent systems. The tuner module supports tuning agents within such multi-agent settings, allowing you to optimize a subset of agents while other agents serve as the environment or opponents. This tutorial builds on the concepts introduced in Overview and Agent Reinforcement Learning. Make sure you are familiar with the core components (Task Dataset, Workflow Function, Judge Function) and the basic tuning workflow before proceeding. We will use a simplified werewolf game as the running example throughout this tutorial. In this game, 7 players (2 werewolves, 3 villagers, 1 seer, 1 witch) interact over multiple rounds of discussion and voting. The goal is to train the werewolf players to improve their win rate.

Key Differences from Single-Agent Tuning

Tuning in a multi-agent system introduces several additional considerations compared to single-agent tuning:

Aspect	Single-Agent Tuning	Multi-Agent Tuning
Workflow Function	One agent processes a task	Multiple agents interact with each other
Model Assignment	One trainable model	Trainable model + auxiliary models for other agents
Reward Design	Based on individual agent output	Based on collective outcome (e.g., game result)
Judge Function	Evaluates single response	Should be integrated into the workflow
Complexity	Simpler, shorter interactions	Longer episodes with multi-turn interactions

The key idea is straightforward: assign the trainable model to the agents you want to tune, and use auxiliary models for all other agents. The tuner only updates the weights of the trainable model, while auxiliary models remain frozen during training.

Design the Workflow Function

In multi-agent tuning, the workflow function orchestrates the entire multi-agent interaction. It creates all agents, assigns models, runs the interaction, and returns the result. The workflow function accepts two additional parameters compared to single-agent tuning:

model: The trainable model, assigned to the agents you want to tune.
auxiliary_models: A dictionary of auxiliary models for the remaining agents.

Below is a simplified workflow function for the werewolf game, where we train the werewolf players:

from typing import Dict
import numpy as np
from agentscope.agent import ReActAgent
from agentscope.formatter import OpenAIChatFormatter
from agentscope.model import ChatModelBase
from agentscope.tuner import WorkflowOutput


async def werewolf_workflow(
    task: Dict,
    model: ChatModelBase,
    auxiliary_models: Dict[str, ChatModelBase],
) -> WorkflowOutput:
    """Run a werewolf game and train the werewolf players."""
    # Define roles and shuffle based on task seed
    roles = ["werewolf"] * 2 + ["villager"] * 3 + ["seer", "witch"]
    np.random.seed(task["seed"])
    np.random.shuffle(roles)

    # Get the auxiliary model for non-trainable players
    participant_model = auxiliary_models["participant"]

    # Create players: werewolves use trainable model, others use auxiliary model
    players = []
    for i, role in enumerate(roles):
        agent = ReActAgent(
            name=f"Player{i + 1}",
            sys_prompt=get_player_prompt(f"Player{i + 1}"),
            model=model if role == "werewolf" else participant_model,
            formatter=OpenAIChatFormatter(),
        )
        players.append(agent)

    # Run the game and compute reward
    good_guy_win = await werewolves_game(players, roles)
    reward = 0.0 if good_guy_win else 1.0  # reward werewolves for winning

    return WorkflowOutput(
        reward=reward,
        metrics={"werewolf_win": float(not good_guy_win)},
    )

In this example, the trainable model is assigned to the werewolf players, while all other roles (villagers, seer, witch) use the frozen auxiliary_models["participant"]. During tuning, only the werewolf model’s weights are updated.

The auxiliary_models dictionary keys must match the names you configure later in the tune() call. In the example above, the key "participant" is used to retrieve the model for non-werewolf players.

Assign Models to Agents

The core pattern for multi-agent tuning is to selectively assign the trainable model to the agents you want to optimize, and auxiliary models to all other agents.

Tuning a Specific Role

In the werewolf example above, we train only the werewolf players by checking the role:

model=model if role == "werewolf" else auxiliary_models["participant"]

Tuning Multiple Roles Simultaneously

You can also train multiple roles at once using the same trainable model. For example, to train all good guy roles (villagers, seer, witch) instead of werewolves:

trainable_roles = ["villager", "seer", "witch"]

for i, role in enumerate(roles):
    agent = ReActAgent(
        name=f"Player{i + 1}",
        sys_prompt=get_player_prompt(f"Player{i + 1}"),
        model=model if role in trainable_roles else auxiliary_models["participant"],
        formatter=OpenAIChatFormatter(),
    )

When tuning multiple roles with the same trainable model, the model learns to handle all assigned roles. This is more challenging but can produce a more versatile agent. Start with tuning a single role first to validate your setup.

Design Rewards for Multi-Agent Systems

Reward design is especially important in multi-agent settings because the outcome depends on the interactions between all agents, not just a single agent’s response. In the werewolf game, the reward is naturally derived from the game outcome — whether the trainable team wins or loses:

good_guy_win = await werewolves_game(players, roles)

# Reward from the perspective of the trainable team
if not good_guy_win:  # werewolves win
    reward = 1.0
else:                  # villagers win
    reward = 0.0

return WorkflowOutput(
    reward=reward,
    metrics={"werewolf_win": float(not good_guy_win)},
)

When you compute reward directly in the workflow, pass judge_func=None to the tune() function.

For multi-agent scenarios without clear win/loss outcomes, you can still use a separate judge function or LLM-as-a-Judge as described in Agent Reinforcement Learning — LLM-as-a-Judge.

Handle Errors Gracefully

Multi-agent interactions are inherently more complex and error-prone than single-agent tasks. It is important to handle exceptions in the workflow function to prevent training failures:

async def werewolf_workflow(
    task: Dict,
    model: ChatModelBase,
    auxiliary_models: Dict[str, ChatModelBase],
) -> WorkflowOutput:
    try:
        # ... set up agents and run the game ...
        good_guy_win = await werewolves_game(players, roles)
        reward = 0.0 if good_guy_win else 1.0
        return WorkflowOutput(reward=reward, metrics={"success": 1.0})

    except Exception as e:
        # Assign a small penalty for errors instead of crashing
        print(f"Error during game execution: {e}")
        return WorkflowOutput(
            reward=-0.1,
            metrics={"success": 0.0, "error": 1.0},
        )

Assigning a small negative reward (e.g., -0.1) for errors discourages the model from producing outputs that cause execution failures, while not overly penalizing the model for occasional issues.

Configuration & Tuning

After implementing the workflow function, configure the tuning process. The key difference from single-agent tuning is the addition of auxiliary_models in the tune() call:

from agentscope.tuner import tune, AlgorithmConfig, DatasetConfig, TunerModelConfig

# Define your workflow function ...

if __name__ == "__main__":
    dataset = DatasetConfig(
        path="data",  # dataset containing {"seed": 0}, {"seed": 1}, ...
        split="train",
    )

    # The trainable model (will be tuned) — used by werewolf players
    model = TunerModelConfig(
        model_path="Qwen/Qwen2.5-7B-Instruct",
        max_model_len=16384,
    )

    # Auxiliary models (frozen) — used by other players
    auxiliary_models = {
        "participant": TunerModelConfig(
            model_path="Qwen/Qwen3-30B-A3B-Instruct-2507",
            max_model_len=16384,
        ),
    }

    algorithm = AlgorithmConfig(
        algorithm_type="multi_step_grpo",
        group_size=32,
        batch_size=24,
        learning_rate=1e-6,
    )

    tune(
        workflow_func=werewolf_workflow,
        judge_func=None,  # reward computed directly in the workflow
        train_dataset=dataset,
        model=model,
        auxiliary_models=auxiliary_models,
        algorithm=algorithm,
    )

Key configuration considerations for multi-agent tuning:

auxiliary_models: A dictionary mapping model names to TunerModelConfig. The keys must match those used in auxiliary_models parameter of your workflow function.
group_size: In multi-agent settings, each task episode involves multiple agents interacting over many turns, making each rollout more expensive. Consider balancing group size with available compute resources.
model.max_model_len: Multi-agent interactions typically produce longer conversation histories. Set a sufficiently large max_model_len to accommodate the full interaction.

Each auxiliary model requires its own GPU resources (inference_engine_num * tensor_parallel_size GPUs per model). Plan your GPU budget accordingly when using multiple auxiliary models.

Switching Training Target via `workflow_args`

In the werewolf game, you may want to train werewolves in one run and good guys in another, without changing the code. The tuner supports passing extra arguments to the workflow function through the task’s workflow_args field.

async def werewolf_workflow(
    task: Dict,
    model: ChatModelBase,
    auxiliary_models: Dict[str, ChatModelBase],
) -> WorkflowOutput:
    roles = ["werewolf"] * 2 + ["villager"] * 3 + ["seer", "witch"]
    # ...

    # Read trainable_target from workflow_args
    workflow_args = task.get("workflow_args", {})
    trainable_target = workflow_args.get("trainable_target", "werewolf")

    for i, role in enumerate(roles):
        if trainable_target == "werewolf":
            use_trainable = role == "werewolf"
        else:  # trainable_target == "good_guy"
            use_trainable = role in ["villager", "seer", "witch"]

        agent = ReActAgent(
            name=f"Player{i + 1}",
            sys_prompt=get_player_prompt(f"Player{i + 1}"),
            model=model if use_trainable else auxiliary_models["participant"],
            formatter=OpenAIChatFormatter(),
        )
    # ...

You can configure workflow_args in your YAML configuration file:

buffer:
  explorer_input:
    taskset:
      workflow_args:
        trainable_target: werewolf  # or "good_guy"

This allows you to switch the training target simply by modifying the configuration file, without changing any code.

Complete Example

Werewolf Game Training Example

A full end-to-end example training werewolf agents in a 7-player social deduction game — achieving ~85% win rate (up from ~50%) with configurable training targets.

Best Practices

Start simple and scale up

Begin with a small number of agents and short interaction episodes. Scale up once you confirm the setup works correctly.

Validate locally first

Run your workflow function locally with a few test tasks before launching the full tuning process to catch bugs early.

Use a stronger auxiliary model

Using a stronger model for auxiliary agents provides a more challenging and stable environment for the trainable agents, which generally leads to better training outcomes.

Monitor with logging

Add the logger parameter to your workflow function (see Agent Reinforcement Learning — Runtime Monitoring) to debug multi-agent interactions during tuning.

Design clear reward signals

In multi-agent settings, sparse rewards (e.g., only win/loss at the end) can slow training. Consider adding intermediate reward signals when possible.

Handle long episodes

Multi-agent interactions can produce very long conversation histories. Set max_model_len appropriately and consider adding timeouts in your workflow to avoid excessively long episodes.

Get Started

Tutorial

Out-of-box Agents

Basic Concepts

Building Blocks

Observe & Evaluate

Tune Agent

Deploy & Serve

Others

Key Differences from Single-Agent Tuning

Design the Workflow Function

Assign Models to Agents

Tuning a Specific Role

Tuning Multiple Roles Simultaneously

Design Rewards for Multi-Agent Systems

Handle Errors Gracefully

Configuration & Tuning

Switching Training Target via `workflow_args`

Complete Example

Werewolf Game Training Example

Best Practices

Get Started

Tutorial

Out-of-box Agents

Basic Concepts

Building Blocks

Observe & Evaluate

Tune Agent

Deploy & Serve

Others

​Key Differences from Single-Agent Tuning

​Design the Workflow Function

​Assign Models to Agents

​Tuning a Specific Role

​Tuning Multiple Roles Simultaneously

​Design Rewards for Multi-Agent Systems

​Handle Errors Gracefully

​Configuration & Tuning

​Switching Training Target via workflow_args

​Complete Example

Werewolf Game Training Example

​Best Practices

Key Differences from Single-Agent Tuning

Design the Workflow Function

Assign Models to Agents

Tuning a Specific Role

Tuning Multiple Roles Simultaneously

Design Rewards for Multi-Agent Systems

Handle Errors Gracefully

Configuration & Tuning

Switching Training Target via `workflow_args`

Complete Example

Best Practices