Skip to main content
Model selection tuning enables you to automatically identify the best performing model from a set of candidates based on evaluation metrics. This method is ideal when you have multiple models available and want to systematically compare them against your specific tasks, cost constraints, and latency requirements — without manual trial-and-error. You will learn the following core features step by step:
CapabilityCorresponding Component
Implement task execution logicWorkflow Function
Evaluate model outputs objectivelyJudge Function, JudgeOutput
Run automated selection over a datasetDatasetConfig, select_model

Core Components

The model selection process involves three core components that work together:
  • Workflow Function: An async function that executes your agent logic with a given model and returns the result.
  • Judge Function: Evaluates the workflow output and returns a reward indicating performance (higher is better).
  • Task Dataset: A collection of tasks for evaluating and comparing models.
WorkflowOutput and JudgeOutput are framework-provided data classes. Your workflow and judge functions must return instances of these types. Do not define your own output classes.

Prerequisites

Before running the examples, install the required dependencies and set up your API key:
# Install core dependencies
pip install agentscope

# For the BLEU score example, install additional packages
pip install sacrebleu

# Set your DashScope API key
export DASHSCOPE_API_KEY="your_api_key_here"
Never commit your API key to version control. Use environment variables or a .env file (with python-dotenv) for local development.

Setup & Configuration

Define your candidate models that will be evaluated:
import os
from agentscope.model import DashScopeChatModel

candidate_models = [
    DashScopeChatModel(
        "qwen-turbo",
        api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
        max_tokens=512,
    ),
    DashScopeChatModel(
        "qwen-plus",
        api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
        max_tokens=512,
    ),
    DashScopeChatModel(
        "qwen-max",
        api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
        max_tokens=512,
    ),
]

Defining the Workflow Function

The workflow function executes your agent logic with a given model and returns a standardized result. It must:
  • Accept a task (e.g., a question or input) and a model instance
  • Run inference using that model
  • Return a WorkflowOutput object containing the model’s response
This pattern is identical to the one introduced in Overview. For a complete implementation example — including how to set up an agent, format messages, and return structured output — please refer to that guide.
Your function must return WorkflowOutput. Do not define custom output classes.

Implementing the Judge Function

The judge function evaluates the output of the workflow and assigns a numerical reward (higher = better) along with optional diagnostic metrics. It must:
  • Accept the original task and the response from the workflow
  • Compute a scalar reward (e.g., accuracy, BLEU score, or inverse latency)
  • Return a JudgeOutput object with reward and metrics
This follows the same contract described in Overview. That guide provides a step-by-step example of building a correctness-based judge.

Using Built-in Judges

AgentScope provides built-in judge functions for common efficiency metrics:
from agentscope.tuner.model_selection import (
    avg_time_judge,
    avg_token_consumption_judge,
)

# For selecting based on fastest execution time
judge_function = avg_time_judge

# For selecting based on lowest token consumption
judge_function = avg_token_consumption_judge

Running Model Selection

With your components defined, run the model selection process:
from agentscope.tuner import DatasetConfig
from agentscope.tuner.model_selection import select_model

dataset_config = DatasetConfig(
    path="your_dataset_path",
    split="test",
    total_steps=100,  # Optional: limit evaluation steps
)

best_model, metrics = await select_model(
    workflow_func=workflow,
    judge_func=judge_function,
    train_dataset=dataset_config,
    candidate_models=candidate_models,
)

print(f"Selected best model: {best_model.model_name}")
print(f"Metrics: {metrics}")

# Example metrics structure:
# {
#   "qwen-turbo": {"avg_time": 1.24, "accuracy": 0.78},
#   "qwen-plus": {"avg_time": 2.05, "accuracy": 0.85},
#   "qwen-max":  {"avg_time": 3.80, "accuracy": 0.86}
# }
Key configurations include:
  • workflow_func: The workflow function that executes tasks with different models.
  • judge_func: The judge function that evaluates performance.
  • train_dataset: Configuration for the evaluation dataset.
  • candidate_models: List of models to compare.

Supported Dataset Formats

DatasetConfig supports multiple data sources:
TypeExample pathFormat
Hugging Face Dataset"openai/gsm8k"Must specify name and split
Local JSON File"./data/tasks.json"Array of objects with question/answer
Local JSONL File"./data/tasks.jsonl"One JSON object per line

Minimal JSON Example (tasks.json)

[
  {
    "question": "What is 2 + 2?",
    "answer": "4"
  },
  {
    "question": "Translate 'hello' to French.",
    "answer": "bonjour"
  }
]

Complete Examples

Example 1: Token Usage Optimization

This example selects the best model based on token consumption:
import os
import asyncio
from typing import Dict, Any
from agentscope.agent import ReActAgent
from agentscope.formatter import OpenAIChatFormatter
from agentscope.message import Msg
from agentscope.model import DashScopeChatModel
from agentscope.tuner import DatasetConfig, WorkflowOutput
from agentscope.tuner.model_selection import (
    select_model,
    avg_token_consumption_judge,
)

candidate_models = [
    DashScopeChatModel(
        "qwen-turbo",
        api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
        max_tokens=512,
    ),
    DashScopeChatModel(
        "qwen-plus",
        api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
        max_tokens=512,
    ),
]

async def workflow(
    task: Dict[str, Any],
    model: Any,
) -> WorkflowOutput:
    agent = ReActAgent(
        name="solver",
        sys_prompt="You are a helpful problem solving agent.",
        model=model,
        formatter=OpenAIChatFormatter(),
    )

    question = task.get("question", "")
    msg = Msg(name="user", content=question, role="user")
    response = await agent.reply(msg=msg)

    return WorkflowOutput(response=response)

async def main():
    dataset_config = DatasetConfig(
        path="openai/gsm8k",
        name="main",
        split="test",
        total_steps=20,
    )

    best_model, metrics = await select_model(
        workflow_func=workflow,
        judge_func=avg_token_consumption_judge,
        train_dataset=dataset_config,
        candidate_models=candidate_models,
    )

    print(f"Best model: {best_model.model_name}")
    print(f"Metrics: {metrics}")

if __name__ == "__main__":
    asyncio.run(main())
    # If running in Jupyter Notebook, use:
    # await main()

Example 2: Translation Quality with BLEU Score

This example selects the best model for translation tasks based on BLEU score:
import os
import asyncio
from typing import Dict, Any
from agentscope.agent import ReActAgent
from agentscope.formatter import OpenAIChatFormatter
from agentscope.message import Msg
from agentscope.model import DashScopeChatModel
from agentscope.tuner import DatasetConfig, WorkflowOutput, JudgeOutput
from agentscope.tuner.model_selection import select_model

models = [
    DashScopeChatModel(
        "qwen3-max-2025-09-23",
        api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
        max_tokens=1024,
    ),
    DashScopeChatModel(
        "Moonshot-Kimi-K2-Instruct",
        api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
        max_tokens=1024,
    ),
    DashScopeChatModel(
        "MiniMax-M2.1",
        api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
        max_tokens=1024,
    ),
]

async def translation_workflow(
    task: Dict[str, Any],
    model: Any,
) -> WorkflowOutput:
    agent = ReActAgent(
        name="translator",
        sys_prompt="You are a helpful translation agent. Translate the given text accurately, and only output the translated text.",
        model=model,
        formatter=OpenAIChatFormatter(),
    )

    source_text = task.get("question", "") if isinstance(task, dict) else str(task)
    prompt = f"Translate the following text between English and Chinese: {source_text}"
    msg = Msg(name="user", content=prompt, role="user")
    response = await agent.reply(msg=msg)

    return WorkflowOutput(response=response)

async def bleu_judge(
    task: Dict[str, Any],
    response: Any,
) -> JudgeOutput:
    import sacrebleu

    response_str = ""
    if hasattr(response, 'response'):
        response_content = response.response
        if hasattr(response_content, 'content'):
            if isinstance(response_content.content, list):
                for content_item in response_content.content:
                    if isinstance(content_item, dict) and "text" in content_item:
                        response_str += content_item["text"]
                    elif hasattr(content_item, 'text'):
                        response_str += content_item.text
            else:
                response_str = str(response_content.content)
    else:
        raise ValueError("Response is not a WorkflowOutput, please check again")

    reference_translation = task.get("answer", "") if isinstance(task, dict) else ""
    ref = reference_translation.strip()
    pred = response_str.strip()

    bleu_score = sacrebleu.sentence_bleu(pred, [ref])

    return JudgeOutput(
        reward=bleu_score.score,
        metrics={
            "bleu": bleu_score.score / 100,
            "brevity_penalty": bleu_score.bp,
            "ratio": bleu_score.ratio,
        },
    )

async def main():
    dataset_config = DatasetConfig(
        path=os.path.join(os.path.dirname(__file__), "translate_data"),
        split="test",
    )

    best_model, metrics = await select_model(
        workflow_func=translation_workflow,
        judge_func=bleu_judge,
        train_dataset=dataset_config,
        candidate_models=models,
    )

    print(f"Selected best model: {best_model.model_name}")
    print(f"Metrics: {metrics}")

if __name__ == "__main__":
    asyncio.run(main())
    # If running in Jupyter Notebook, use:
    # await main()

Key Benefits

Performance optimization

Identify the model that achieves the highest accuracy on your specific task.

Cost efficiency

Select models that achieve desired performance with lower computational costs.

Latency control

Choose models that meet your speed constraints without sacrificing quality.

Resource awareness

Find the best model that fits within your infrastructure limitations.

Best Practices

Start small to save cost and time: Model selection evaluates every candidate model on every task. Use total_steps=10 in DatasetConfig for initial testing — a full run with 3 models and 100 tasks may cost 300× a single inference.
  • Choose appropriate metrics: Align your judge function with your actual goals (accuracy, efficiency, cost, etc.)
  • Monitor detailed metrics: Use detailed metrics to understand the trade-offs between different models
  • Validate results: Manually check a few outputs from your selected model to ensure quality meets expectations