Skip to main content
Data Science Agent is an autonomous, end-to-end assistant that transforms high-level analytical questions into executable data science workflows. It seamlessly handles the full pipeline from data acquisition and cleaning to modeling, visualization, and narrative reporting with minimal human intervention, enabling users to move efficiently from intent to insight in real-world scenarios. The agent is equipped with proven analytical methodologies including Pareto Drill-Down Analysis for identifying root causes, Hypothesis-Driven Analysis for rigorous validation, and Root Cause Analysis for systematic problem investigation, ensuring high-quality insights across diverse data science tasks.
Data Science Agent overall workflow

Quick Start

1

Prerequisites

  • Python 3.10 or higher
  • Docker (for sandbox execution environment)
  • DashScope API key from Alibaba Cloud
2

Installation

# From the project root directory
cd agentscope-samples/alias
pip install -e .
This installs the alias_agent command-line tool.
3

Configuration

Set up the following necessary API key and environment variables:
export DASHSCOPE_API_KEY="your_dashscope_api_key_here"
4

Sandbox Setup

The Data Science Agent requires a sandbox environment for secure code execution. Start the sandbox server:
# If using colima, set Docker host
export DOCKER_HOST=unix://$HOME/.colima/default/docker.sock

# Pull the sandbox image
export RUNTIME_SANDBOX_REGISTRY=agentscope-registry.ap-southeast-1.cr.aliyuncs.com
docker pull agentscope-registry.ap-southeast-1.cr.aliyuncs.com/agentscope/runtime-sandbox-alias:latest
5

Usage

You can start running the Data Science agent in your terminal with the following command:
# Example with built-in sample data (ready to run)
alias_agent run --mode ds \
  --task "Analyze the distribution of incidents across categories in 'incident_records.csv' to identify imbalances, inconsistencies, or anomalies, and determine their root cause." \
  --datasource ./docs/data/incident_records.csv
  
# Multi-file analysis (replace with your file paths)
alias_agent run --mode ds \
  --task "Compare sales trends across regions" \
  --datasource ./path/to/your/data1.csv ./path/to/your/data2.csv

# Database analysis (replace with your database credentials)
alias_agent run --mode ds \
  --task "Identify customer churn patterns" \
  --datasource postgresql://user:password@localhost:5432/mydb
The Data Science Agent supports multiple data source formats:
  • Local Files: CSV, Excel, JSON files — example: --datasource ./data.csv ./data.xlsx
  • Databases: PostgreSQL, SQLite, and other SQL databases — example: --datasource postgresql://user:pass@host:5432/db
  • Multiple Sources: Combine different data sources — example: --datasource ./file1.csv ./file2.xlsx postgresql://...
All uploaded files are automatically copied to the /workspace directory in the sandbox for secure processing.
The Data Science Agent is built with DashScope chat models. If you want to change the model, ensure you also update the formatter accordingly. The correspondence between built-in models and formatters is listed in the Provider Reference.

Key Features

Intelligent Scenario Routing

At startup, the Data Science Agent uses an intelligent router to automatically assign user tasks to one of three core scenarios:
  • Exploratory Data Analysis (EDA): For understanding data distributions, patterns, and relationships
  • Predictive Modeling: For building machine learning models and forecasts
  • Exact Data Computation: For precise calculations and aggregations
Each scenario is driven by a dedicated prompt template tailored to its analytical intent, ensuring optimal performance for different types of data science tasks.
available_prompts = {
    "explorative_data_analysis": "EDA workflow with visualization emphasis",
    "data_modeling": "ML pipeline with model evaluation",
    "data_computation": "Precise numerical computation workflow",
}
Custom Scenarios: You can add new scenarios by creating prompt templates (e.g., _scenario_explorative_data_analysis.md) and registering them in available_prompts.

Scalable File Filtering Pipeline

The agent features a sophisticated file filtering system that quickly locates relevant files in massive data lakes. This is particularly useful when working with:
  • Large directories with hundreds of files
  • Mixed file types and formats
  • Complex data hierarchies
The filtering pipeline analyzes file names, metadata, and content to intelligently select the most relevant data sources for your analysis task.
async def files_filter_pre_reply_hook(
    self,
    msg: Msg | list[Msg] | None = None,
) -> Msg | list[Msg] | None:
    """Filter and select relevant files from the data source manager"""

Robust Spreadsheet Parsing

The Data Science Agent can handle irregular spreadsheets that are common in real-world scenarios through the clean_messy_spreadsheet function:
  • Merged Cells: Correctly interprets cells spanning multiple rows/columns
  • Multi-level Headers: Handles complex header structures
  • Embedded Notes: Extracts and processes metadata and annotations stored in __metadata field
  • Mixed Data Types: Intelligently handles numeric, text, and date data
The parser automatically converts irregular Excel files into well-structured JSON with two processing strategies:
async def clean_messy_spreadsheet(toolkit, file: str) -> ToolResponse:
    """
    Clean the given messy spreadsheet and convert it into a readable JSON.
    Uses LLM to extract structured tables with semantic understanding.
    
    Output structure:
    {
        "sheet_name": {
            "table_name": [
                ["column1", "column2", ...],  # Headers
                [value1, value2, ...],         # Row 1
                [value1, value2, ...]          # Row 2
            ],
            "__metadata": ["descriptive text"]
        }
    }
    """

Multimodal Understanding

Beyond tabular data, the agent supports multimodal analysis capabilities through vision-language models:

Image Summarization

Extracts comprehensive information from images including text, objects, layout, and chart insights:
def summarize_image(
    dash_scope_multimodal_tool_set,
    image_path: str,
) -> ToolResponse:
    """
    Use a vision-language model to extract all information from the image,
    including text, objects, layout relationships, chart conclusions, etc.
    
    Args:
        image_path: Path to the image file, e.g., '/workspace/image.jpg'
    """

Question Answering About Images

Answers natural language questions about image content:
def answer_question_about_image(
    dash_scope_multimodal_tool_set,
    image_path: str,
    question: str,
) -> ToolResponse:
    """
    Answer questions about image content using a vision-language model.
    
    Args:
        image_path: Path to the image file, e.g., '/workspace/image.jpg'
        question: A natural language question, e.g., "What is the trend in Q3?"
    """

Automatic Report Generation

For EDA tasks, the Data Science Agent automatically generates comprehensive interactive HTML reports that combine:
  • Insights: Key findings and patterns discovered in the data
  • Visualizations: Interactive charts and graphs
  • Executable Code: Python code used for analysis (ensuring reproducibility)
  • Narrative: Clear explanations of analytical steps and conclusions
async def generate_response(self, **kwargs: Any) -> ToolResponse:
    """Generate required structured output and comprehensive report"""
    report_generator = ReportGenerator(
        model=self.model,
        formatter=self.formatter,
        memory_log=memory_log,
    )
    
    response, report_md, report_html = await report_generator.generate_report()