Models

This document covers detailed usage examples and provider-specific references for each model class in AgentScope.

ChatModel

Text generation, streaming, reasoning, and tools API.

TTS Models

Non-realtime and realtime text-to-speech synthesis.

Realtime Models

Bidirectional WebSocket streaming for voice agents.

Embedding Models

Vector representations for retrieval and similarity search.

For core concepts and design principles, see Model. For details on Msg and content blocks, see Msg.

ChatModel

Basic Usage

All chat model classes share a unified __call__ interface. The input to __call__ is the formatted messages — the result of applying a formatter to Msg objects. This formatted input matches the exact format expected by the underlying API provider. Method signature:

async def __call__(
    self,
    messages: list[dict],
    tools: list[dict] | None = None,
    tool_choice: Literal["auto", "none", "required"] | str | None = None,
    structured_model: Type[BaseModel] | None = None,
    **kwargs: Any,
) -> ChatResponse | AsyncGenerator[ChatResponse, None]:
    """
    Call the chat model with formatted messages.

    Args:
        messages: Formatted messages (provider-specific format)
        tools: Optional tool schemas
        tool_choice: Tool invocation mode
        structured_model: Optional Pydantic model for structured output
        **kwargs: Additional provider-specific parameters

    Returns:
        - ChatResponse: when stream=False
        - AsyncGenerator[ChatResponse, None]: when stream=True
    """

Typical workflow when calling a model directly: In AgentScope, agents communicate by passing Msg objects. When calling a model directly (outside an agent), the typical flow is:

Build Msg objects with name, role, and content (text or content blocks)
Use a Formatter to convert [Msg] into the provider-specific message format
Call the ChatModel with the formatted messages to get a ChatResponse

When using an agent (e.g., ReActAgent), steps 2-3 are handled automatically — the agent internally manages the Msg → Formatter → Model → ChatResponse pipeline. Example workflow:

import asyncio
import os
from agentscope.formatter import DashScopeChatFormatter
from agentscope.model import DashScopeChatModel
from agentscope.message import Msg

async def example_model_call():
    # Step 1: Create model and formatter
    model = DashScopeChatModel(
        model_name="qwen-max",
        api_key=os.environ["DASHSCOPE_API_KEY"],
        stream=False,
    )
    formatter = DashScopeChatFormatter()

    # Step 2: Build Msg objects
    user_msg = Msg(name="user", content="Hi!", role="user")

    # Step 3: Format messages (convert to provider-specific format)
    formatted_messages = await formatter.format([user_msg])

    # Step 4: Call model with formatted messages and get ChatResponse
    res = await model(formatted_messages)

    print("Response:", res.content)
    print("Usage:", res.usage)

asyncio.run(example_model_call())

The key point: ChatModel accepts formatted messages (the output of a formatter), not raw Msg objects. This design allows each model to receive input in its native API format. The model returns a ChatResponse object containing the generated content and usage information.

Streaming

To enable streaming, set stream=True in the constructor. When streaming is enabled, __call__ returns an async generator that yields ChatResponse instances.

Streaming in AgentScope is accumulative — each chunk contains all previous content plus newly generated content, not just the delta. This simplifies consumption since you always have the complete current state without tracking deltas.

async def example_streaming():
    model = DashScopeChatModel(
        model_name="qwen-max",
        api_key=os.environ["DASHSCOPE_API_KEY"],
        stream=True,
    )
    formatter = DashScopeChatFormatter()

    user_msg = Msg(name="user", content="Count from 1 to 5.", role="user")
    formatted_messages = await formatter.format([user_msg])

    # Get async generator
    generator = await model(formatted_messages)

    # Iterate through chunks (each contains accumulated content)
    async for chunk in generator:
        print(chunk.content)  # Accumulated content up to this point

asyncio.run(example_streaming())

Example output (each line shows accumulative text):

[{'type': 'text', 'text': '1'}]
[{'type': 'text', 'text': '1\n2'}]
[{'type': 'text', 'text': '1\n2\n3'}]
[{'type': 'text', 'text': '1\n2\n3\n4'}]
[{'type': 'text', 'text': '1\n2\n3\n4\n5'}]

Reasoning

AgentScope supports reasoning models (chain-of-thought) via ThinkingBlock. When enable_thinking=True, the model’s response includes both thinking process and final answer.

async def example_reasoning():
    model = DashScopeChatModel(
        model_name="qwen-turbo",
        api_key=os.environ["DASHSCOPE_API_KEY"],
        enable_thinking=True,  # Enable reasoning
        stream=True,
    )
    formatter = DashScopeChatFormatter()

    user_msg = Msg(name="user", content="What is 17 * 23?", role="user")
    formatted_messages = await formatter.format([user_msg])

    res = await model(formatted_messages)

    # Collect final chunk
    last_chunk = None
    async for chunk in res:
        last_chunk = chunk

    # Response contains both ThinkingBlock and TextBlock
    for block in last_chunk.content:
        block_type = block['type']
        content = block.get('thinking') or block.get('text')
        print(f"[{block_type}] {content[:80]}...")

asyncio.run(example_reasoning())

The thinking content is streamed alongside text content in accumulative mode.

Tools API

AgentScope provides a unified tools interface across all providers. Tools are defined using a standardized JSON schema format and passed to the model via the tools parameter.

async def example_tools():
    model = DashScopeChatModel(
        model_name="qwen-max",
        api_key=os.environ["DASHSCOPE_API_KEY"],
        stream=False,
    )
    formatter = DashScopeChatFormatter()

    # Define tool schema
    json_schemas = [
        {
            "type": "function",
            "function": {
                "name": "google_search",
                "description": "Search for a query on Google.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "The search query.",
                        },
                    },
                    "required": ["query"],
                },
            },
        },
    ]

    user_msg = Msg(name="user", content="Search AgentScope release notes.", role="user")
    formatted_messages = await formatter.format([user_msg])

    # Call model with tools
    response = await model(
        messages=formatted_messages,
        tools=json_schemas,
        tool_choice="auto",  # "auto", "none", "required", or "<function_name>"
    )

    print(response.content)

asyncio.run(example_tools())

The tool_choice parameter controls invocation behavior:

"auto": Model decides whether to call a tool
"none": No tools will be called
"required": Model must call at least one tool
"<function_name>": Force a specific tool

Use the Toolkit class to auto-generate JSON schemas from Python functions with docstrings. See Tool for details.

Provider Reference

AgentScope supports multiple chat model providers. Each provider has a corresponding model class and formatter:

Provider	Model Class	Formatter	Key Features
OpenAI	`OpenAIChatModel`	`OpenAIChatFormatter` / `OpenAIChatMultiAgentFormatter`	Supports OpenAI, vLLM, DeepSeek, and OpenAI-compatible APIs
DashScope	`DashScopeChatModel`	`DashScopeChatFormatter` / `DashScopeMultiAgentFormatter`	Supports Qwen models, VL models, reasoning models
Gemini	`GeminiChatModel`	`GeminiChatFormatter` / `GeminiMultiAgentFormatter`	Google Gemini models with multimodal support
Anthropic	`AnthropicChatModel`	`AnthropicChatFormatter` / `AnthropicMultiAgentFormatter`	Claude models with extended thinking
Ollama	`OllamaChatModel`	`OllamaChatFormatter` / `OllamaMultiAgentFormatter`	Local LLM hosting

For detailed provider-specific parameters and examples, refer to the original documentation or source code.

Token Counting

AgentScope provides a token counter module under agentscope.token to estimate the number of tokens in a set of messages before sending them to a model. This is useful for managing context window budgets and implementing prompt truncation strategies.

The formatter module integrates token counters to support automatic prompt truncation. When a token budget is configured, the formatter uses the corresponding counter to trim messages before they are sent to the model.

Supported providers:

Provider	Class	Image Data	Tools
Anthropic	`AnthropicTokenCounter`	✅	✅
OpenAI	`OpenAITokenCounter`	✅	✅
Gemini	`GeminiTokenCounter`	✅	✅
HuggingFace	`HuggingFaceTokenCounter`	Depends on the model	Depends on the model

DashScope does not provide a token-counting API. For DashScope (Qwen) models, use HuggingFaceTokenCounter with the corresponding Qwen tokenizer instead.

import asyncio
from agentscope.token import OpenAITokenCounter

async def example_token_counting():
    messages = [
        {"role": "user", "content": "Hello!"},
        {"role": "assistant", "content": "Hi, how can I help you?"},
    ]

    counter = OpenAITokenCounter(model_name="gpt-4.1")
    n_tokens = await counter.count(messages)

    print(f"Number of tokens: {n_tokens}")

asyncio.run(example_token_counting())

TTS Models

TTS (Text-to-Speech) models convert text into audio. AgentScope supports both non-realtime and realtime TTS models.

Non-Realtime TTS

Non-realtime TTS models require complete text before synthesis. The core method is synthesize(), which accepts a Msg object and returns a TTSResponse containing audio data.

async def synthesize(self, msg: Msg) -> TTSResponse | AsyncGenerator[TTSResponse, None]:
    """
    Synthesize speech from text.

    Args:
        msg: A Msg object containing text content

    Returns:
        - TTSResponse: when stream=False (complete audio)
        - AsyncGenerator[TTSResponse, None]: when stream=True (audio chunks)
    """

Basic usage:

import asyncio
import os
from agentscope.tts import DashScopeTTSModel
from agentscope.message import Msg

async def example_non_realtime_tts():
    tts_model = DashScopeTTSModel(
        api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
        model_name="qwen3-tts-flash",
        voice="Cherry",
        stream=False,
    )

    msg = Msg(name="assistant", content="Hello, this is a TTS demo.", role="assistant")
    tts_response = await tts_model.synthesize(msg)

    # tts_response.content contains an AudioBlock with base64-encoded audio
    print("Audio data length:", len(tts_response.content["source"]["data"]))

asyncio.run(example_non_realtime_tts())

Streaming output (stream=True) returns audio chunks progressively:

async def example_streaming_tts():
    tts_model = DashScopeTTSModel(
        api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
        model_name="qwen3-tts-flash",
        voice="Cherry",
        stream=True,
    )

    msg = Msg(name="assistant", content="Hello, streaming TTS.", role="assistant")

    async for tts_response in await tts_model.synthesize(msg):
        print("Received audio chunk:", len(tts_response.content["source"]["data"]))

asyncio.run(example_streaming_tts())

Realtime TTS

Realtime TTS models accept streaming text input — text chunks can be fed incrementally as they become available (e.g., from a streaming chat model). This enables the lowest possible latency. Core methods:

async def push(self, msg: Msg) -> TTSResponse:
    """
    Non-blocking. Submit text chunk and return any audio received so far.
    """

async def synthesize(self, msg: Msg) -> TTSResponse | AsyncGenerator[TTSResponse, None]:
    """
    Blocking. Finalize the session and return all remaining audio.
    """

Key concepts:

Stateful processing: Only one streaming session can be active at a time, identified by msg.id
Incremental input: Use push() to submit text chunks as they arrive
Finalization: Use synthesize() to complete the session and get remaining audio

Usage example:

from agentscope.tts import DashScopeRealtimeTTSModel

async def example_realtime_tts():
    tts_model = DashScopeRealtimeTTSModel(
        api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
        model_name="qwen3-tts-flash-realtime",
        voice="Cherry",
        stream=False,
    )

    async with tts_model:
        # Push accumulative text chunks (non-blocking)
        res = await tts_model.push(msg_chunk_1)
        res = await tts_model.push(msg_chunk_2)
        # ...
        # Finalize and get all remaining audio (blocking)
        res = await tts_model.synthesize(final_msg)

Integration with Agent: AgentScope agents can automatically synthesize speech when provided with a TTS model. The agent handles the streaming text → TTS pipeline internally.

from agentscope.agent import ReActAgent
from agentscope.formatter import DashScopeChatFormatter
from agentscope.model import DashScopeChatModel
from agentscope.tts import DashScopeRealtimeTTSModel

agent = ReActAgent(
    name="Assistant",
    sys_prompt="You are a helpful assistant.",
    model=DashScopeChatModel(
        api_key=os.environ.get("DASHSCOPE_API_KEY", ""),
        model_name="qwen-max",
        stream=True,
    ),
    formatter=DashScopeChatFormatter(),
    tts_model=DashScopeRealtimeTTSModel(
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        model_name="qwen3-tts-flash-realtime",
        voice="Cherry",
    ),
)

When the agent generates streaming text responses, the TTS model automatically converts them to speech in real-time.

Realtime Models

Realtime models provide bidirectional, persistent communication over WebSocket, designed primarily for voice agent scenarios where the user speaks and the model responds with speech in real-time.

Principle

Realtime models maintain a persistent WebSocket connection that supports:

Bidirectional streaming: Audio/text input and audio/text output flow simultaneously
Low latency: Server-side VAD (Voice Activity Detection) enables natural turn-taking
Multimodal input: Audio, text, images (provider-dependent)
Tool support: Some providers support function calling in realtime (e.g., OpenAI, Gemini)

The key difference from traditional chat models is that realtime models handle the entire voice interaction pipeline (ASR + LLM + TTS) in a single, optimized connection, minimizing latency.

Usage with RealtimeAgent

AgentScope provides RealtimeAgent to work with realtime models. The agent handles the WebSocket connection, audio streaming, and message exchange automatically.

from agentscope.agent import RealtimeAgent
from agentscope.realtime import OpenAIRealtimeModel

# Create realtime model
realtime_model = OpenAIRealtimeModel(
    model_name="gpt-4o-realtime-preview",
    api_key=os.environ["OPENAI_API_KEY"],
    voice="alloy",
)

# Create realtime agent
agent = RealtimeAgent(
    name="VoiceAssistant",
    sys_prompt="You are a helpful voice assistant.",
    model=realtime_model,
)

# Start conversation (handles audio I/O automatically)
await agent.start()

The RealtimeAgent manages:

WebSocket connection lifecycle
Audio input/output streaming
Turn-taking and interruption handling
Tool execution (if supported by the model)

Supported providers:

Provider	Model Class	Audio I/O	Tool Support	VAD
OpenAI	`OpenAIRealtimeModel`	24kHz / 24kHz	Yes	Yes
DashScope	`DashScopeRealtimeModel`	16kHz / 24kHz	No	Yes
Gemini	`GeminiRealtimeModel`	16kHz / 24kHz	Yes	Yes

Embedding Models

Embedding models generate vector representations for text, images, and other data types. These embeddings are used for retrieval, similarity search, and as input features for downstream tasks.

Core Method

All embedding models share a unified __call__ interface that accepts input data and returns an EmbeddingResponse:

async def __call__(
    self,
    inputs: List[str | TextBlock] | List[TextBlock | ImageBlock | VideoBlock],
    **kwargs,
) -> EmbeddingResponse:
    """
    Generate embeddings for input data.

    Args:
        inputs: Text strings, TextBlocks, or multimodal content blocks

    Returns:
        EmbeddingResponse containing:
        - embeddings: List of embedding vectors
        - usage: Token count and time information
        - source: "api" or "cache"
    """

Text Embedding

Text embedding models accept text strings or TextBlock objects:

import asyncio
import os
from agentscope.embedding import DashScopeTextEmbedding

async def example_text_embedding():
    embedding_model = DashScopeTextEmbedding(
        api_key=os.environ["DASHSCOPE_API_KEY"],
        model_name="text-embedding-v3",
    )

    # Embed text strings
    texts = ["Hello world", "AgentScope is awesome"]
    response = await embedding_model(texts)

    print(f"Generated {len(response.embeddings)} embeddings")
    print(f"Embedding dimension: {len(response.embeddings[0])}")
    print(f"Tokens used: {response.usage.tokens}")

asyncio.run(example_text_embedding())

Multimodal Embedding

Multimodal embedding models accept text, images, and videos using content blocks:

import asyncio
import os
from agentscope.embedding import DashScopeMultiModalEmbedding
from agentscope.message import TextBlock

async def example_multimodal_embedding():
    embedding_model = DashScopeMultiModalEmbedding(
        api_key=os.environ["DASHSCOPE_API_KEY"],
        model_name="multimodal-embedding-v1",
        dimensions=1024,
    )

    # Embed text content (multimodal model also supports text)
    inputs = [
        TextBlock(type="text", text="A beautiful sunset"),
        TextBlock(type="text", text="AgentScope framework"),
    ]

    response = await embedding_model(inputs)

    print(f"Generated {len(response.embeddings)} embeddings")
    print(f"Embedding dimension: {len(response.embeddings[0])}")

asyncio.run(example_multimodal_embedding())

For image and video inputs, use ImageBlock with URLSource (for publicly accessible URLs) or Base64Source (for base64-encoded data). The example above uses text for simplicity.

Provider Reference

AgentScope supports multiple embedding model providers:

Provider	Model Class	Supported Modalities	Key Features
OpenAI	`OpenAITextEmbedding`	Text	High-quality text embeddings with configurable dimensions
DashScope	`DashScopeTextEmbedding`	Text	Qwen-based text embeddings
DashScope	`DashScopeMultiModalEmbedding`	Text, Image, Video	Unified embeddings for cross-modal retrieval
Gemini	`GeminiTextEmbedding`	Text	Google Gemini text embeddings
Ollama	`OllamaTextEmbedding`	Text	Local embedding models

Common parameters:

api_key: API key for authentication
model_name: The embedding model identifier
dimensions: Embedding vector dimension (provider-dependent)
embedding_cache: Optional cache instance to avoid repeated API calls

Usage tips:

Use text embedding models for semantic search, clustering, and classification tasks.
Use multimodal embedding models for cross-modal retrieval (e.g., search images by text).
Enable caching for frequently embedded content to reduce API costs.
Batch multiple inputs in a single call for better efficiency.

Get Started

Tutorial

Out-of-box Agents

Basic Concepts

Building Blocks

Observe & Evaluate

Tune Agent

Deploy & Serve

Others

ChatModel

TTS Models

Realtime Models

Embedding Models

ChatModel

Basic Usage

Streaming

Reasoning

Tools API

Provider Reference

Token Counting

TTS Models

Non-Realtime TTS

Realtime TTS

Realtime Models

Principle

Usage with RealtimeAgent

Embedding Models

Core Method

Text Embedding

Multimodal Embedding

Provider Reference

Get Started

Tutorial

Out-of-box Agents

Basic Concepts

Building Blocks

Observe & Evaluate

Tune Agent

Deploy & Serve

Others

ChatModel

TTS Models

Realtime Models

Embedding Models

​ChatModel

​Basic Usage

​Streaming

​Reasoning

​Tools API

​Provider Reference

​Token Counting

​TTS Models

​Non-Realtime TTS

​Realtime TTS

​Realtime Models

​Principle

​Usage with RealtimeAgent

​Embedding Models

​Core Method

​Text Embedding

​Multimodal Embedding

​Provider Reference

ChatModel

Basic Usage

Streaming

Reasoning

Tools API

Provider Reference

Token Counting

TTS Models

Non-Realtime TTS

Realtime TTS

Realtime Models

Principle

Usage with RealtimeAgent

Embedding Models

Core Method

Text Embedding

Multimodal Embedding

Provider Reference