Hualin Luan Cloud Native · Quant Trading · AI Engineering
Back to articles

Article

Original interpretation: Kaggle white paper "Introduction to Agents" - AI Agent introduction and architecture panorama

In-depth analysis of the five levels, core architecture and production practices of Agent, and sorting out the key framework and inspiration of the Kaggle white paper "Introduction to Agents"

Meta

Published

3/12/2026

Category

interpretation

Reading Time

25 min read

📋 Copyright Statement and Disclaimer

This article is the author’s original analysis article based on the Kaggle white paper “Introduction to Agents”.

Opinion Attribution Statement:

  • The analysis of the five levels of Agent, architecture evolution and engineering practice in this article is the author’s independent organization and reinterpretation based on the white paper
  • The writing structure, case organization and key judgments reflect the author’s personal understanding
  • The article is not a paragraph-by-paragraph translation of the white paper, but an original interpretation oriented to engineering practice.

Original reference:

  • Title: “Introduction to Agents”
  • Authors: Alan Blount, Antonio Gulli, Shubham Saboo, Michael Zimmermann, Vladimir Vuskovic
  • Link: Read original text

Original nature: This article is an independently created interpretation article, not a translation or rewriting. The views expressed in this article represent only the author’s personal understanding and may differ from the original author’s position.


Introduction: From a model that can answer questions to an Agent that can complete tasks

In the past few years, the most amazing capabilities of large models have been mainly reflected in “answering questions” and “generating content”: they can write code, make summaries, generate copywriting, and can also give almost expert-level explanations in many scenarios. But if we switch the perspective from “generating an answer” to “accomplishing a goal”, we will immediately discover the boundaries of traditional LLM - it can provide suggestions, but it may not be able to actively plan, call tools, decompose tasks and execute them all the way to the results.

The real value of the Kaggle white paper “Introduction to Agents” is that it clearly explains this difference: **Agent is not as simple as connecting LLM to a certain workflow, but a complete application system that can continuously make decisions and actions around goals. **

This article will not be repeated paragraph by paragraph in the order of the white paper, but from the perspective of engineering practice, reorganizing the most noteworthy parts: Why is Agent a paradigm shift? How to understand its five levels? From basic reasoning and tool connection to multi-step planning and system collaboration, how does the complexity of the architecture increase step by step? And, why should these concepts directly impact the way you build AI products in the future?


1. From predictive AI to autonomous agents: a paradigm shift

In the past few years, the development focus of AI has always been around the passive, discrete task model - translating a paragraph, answering a question, and generating a picture based on prompts. Although this paradigm is powerful, it requires explicit instructions from humans at every step.

We are now witnessing a paradigm shift from AI that can only predict or create content to new types of software that can solve problems and perform tasks autonomously.

This is the core idea of ​​AI Agent.

1.1 Paradigm comparison: from passive to active

Let’s understand this difference through a specific scenario:

Traditional AI (predictive):

User: What will the weather be in Beijing tomorrow?
AI: Beijing will be cloudy tomorrow, with temperatures from 15 to 22 degrees.
(conversation ends; wait for the next instruction)

Agent (autonomous):

User: I want to visit Beijing this weekend. Any suggestions?
Agent:
1. check Beijing weekend weather forecast
2. search popular Beijing weekend attractions
3. recommend outdoor activities based on the weather
4. check traffic information and recommend transport options
5. combine the information into a complete itinerary

Here is a complete recommendation: because Saturday is sunny, visit the Great Wall in the morning and the Summer Palace in the afternoon...

Agents don’t just answer questions, but actively plan and execute a series of actions to achieve goals.

1.2 What exactly is Agent?

Many people mistakenly believe that Agent embeds LLM into workflow. But in Kaggle’s authoritative 54-page white paper, the authors provide a more precise definition:

The Agent is not an AI model in a static workflow, but a complete application system - one that can make plans and take actions to achieve goals.

Agents combine the reasoning capabilities of language models with practical action capabilities, enabling them to handle complex, multi-step tasks that cannot be accomplished by the model itself.

The key capability is: Agent can work independently and independently figure out the subsequent steps required to achieve the goal without human guidance at every step.

1.3 Why is Agent important now?

There are three key factors driving the rise of Agent:

  1. Leap in model capabilities: Large models such as GPT-4 and Claude have powerful reasoning and planning capabilities
  2. Mature tool ecosystem: Protocols such as Function Calling and MCP standardize the connection between models and the external world.
  3. Actual demand driven: Enterprises need AI systems that can automate complex business processes

2. Five levels of Agent: from reasoning to self-evolution

The white paper proposes a very valuable Taxonomy, which divides the Agent system into 5 levels. This classification not only helps us understand the evolution path of Agent, but also provides a clear reference framework for architecture design.

Level 0: Core Reasoning System

This is the base layer and contains only the language model itself. It can reason and answer questions, but has no ability to interact with the outside world.

Capability Boundary:

  • Can answer knowledge questions
  • Able to reason logically
  • Can generate text and code
  • Unable to obtain real-time information
  • Unable to perform external operation

Typical Application:

  • Internal knowledge Q&A (based on training data)
  • Text generation and polishing
  • Simple logical reasoning tasks

Architecture Example:

user input → LLM → model output

Code Example:

from openai import OpenAI

client = OpenAI()

class Level0Agent:
    def __init__(self):
        self.system_prompt = "You are a helpful assistant."

    def chat(self, user_message):
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": user_message}
            ]
        )
        return response.choices[0].message.content

Level 1: Connected Problem-Solver

Based on Level 0, tool calling capability is added. Agents can search web pages, query databases, and call APIs to obtain information or perform actions.

Core Improvements:

  • Ability to access real-time information
  • Can call external API
  • Have basic data acquisition capabilities

Typical Application:

  • Q&A assistant with search function
  • Weather query robot
  • Stock information query

Architecture Example:

user input → LLM → Need tool? → Call tool → merge results → output
                ↓ No      ↑
                └────────→ Direct output

Code Example:

import json
from openai import OpenAI

client = OpenAI()

class Level1Agent:
    def __init__(self):
        self.tools = [
            {
                "type": "function",
                "function": {
                    "name": "search_web",
                    "description": "search the web for real-time information",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string"}
                        },
                        "required": ["query"]
                    }
                }
            },
            {
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "description": "get weather for the specified city",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "city": {"type": "string"},
                            "date": {"type": "string"}
                        },
                        "required": ["city"]
                    }
                }
            }
        ]

    def execute_tool(self, tool_name, params):
        if tool_name == "get_weather":
            # real implementation would call a weather API
            return {"temperature": 22, "condition": "sunny"}
        return {}

    def chat(self, user_message):
        messages = [{"role": "user", "content": user_message}]

        # first call: request tool calls
        response = client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            tools=self.tools,
            tool_choice="auto"
        )

        message = response.choices[0].message

        # check whether tool calls are needed
        if message.tool_calls:
            # execute tool calls
            tool_results = []
            for tool_call in message.tool_calls:
                function_name = tool_call.function.name
                function_params = json.loads(tool_call.function.arguments)
                result = self.execute_tool(function_name, function_params)
                tool_results.append({
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": json.dumps(result)
                })

            # append tool results to the conversation
            messages.append(message)
            messages.extend(tool_results)

            # second call: get the final response
            final_response = client.chat.completions.create(
                model="gpt-4",
                messages=messages
            )
            return final_response.choices[0].message.content

        return message.content

Level 2: Strategic Problem-Solver

This layer introduces planning capabilities and multi-step reasoning. Agent can decompose complex tasks into subtasks and execute them in strategic order.

Core Improvements:

  • Task Decomposition
  • Multi-step planning
  • error recovery mechanism
  • Processing and integration of intermediate results

Typical Application:

  • Research report generation (search → analysis → writing)
  • Data analysis tasks (obtaining data → cleaning → analysis → visualization)
  • Complex problem diagnosis

Architecture Example:

User goal -> Task decomposer -> Subtask 1 -> Execute -> Result

                     ├→ Subtask2 → Execute → result

                     ├→ Subtask3 → Execute → result

                     Result combiner → Final output

Code Example:

class Level2Agent:
    def __init__(self):
        self.level1_agent = Level1Agent()

    def decompose_task(self, goal):
        """decompose a complex task into subtasks"""
        decomposition_prompt = f"""
        decompose the following goal into executable steps:
        Goal: {goal}

        Requirements:
        1. each step should be executable
        2. dependencies between steps must be explicit
        3. output format is a JSON array

        Example output:
        [
            {"step": 1, "action": "search related material", "depends_on": []},
            {"step": 2, "action": "analyze collected information", "depends_on": [1]},
            {"step": 3, "action": "generate a report", "depends_on": [2]}
        ]
        """

        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": decomposition_prompt}]
        )

        steps = json.loads(response.choices[0].message.content)
        return steps

    def execute_task(self, goal):
        """execute complex task"""
        # 1. task decomposition
        steps = self.decompose_task(goal)

        # 2. execute in dependency order
        results = {}
        for step in steps:
            step_num = step["step"]
            action = step["action"]

            # build context, including previous step results
            context = f"Task: {action}\n"
            if step.get("depends_on"):
                for dep in step["depends_on"]:
                    context += f"step{dep} result: {results.get(dep, '')}\n"

            # execute step
            result = self.level1_agent.chat(context)
            results[step_num] = result

            print(f"step {step_num} complete: {action}")

        # 3. merge results
        final_result = self.synthesize_results(goal, results)
        return final_result

    def synthesize_results(self, goal, results):
        """merge results from all steps"""
        synthesis_prompt = f"""
        Goal: {goal}

        Step execution results:
        {json.dumps(results, ensure_ascii=False, indent=2)}

        Merge the results above into the final output.
        """

        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": synthesis_prompt}]
        )

        return response.choices[0].message.content

Level 3: Collaborative Multi-Agent System

Multiple Agents work together, and each Agent may be specialized in a specific field. They can work together, communicate with each other, and coordinate complex parallel tasks.

Core Improvements:

  • Specialized division of labor
  • Communication between agents
  • Parallel execution
  • Results coordination and integration

Analogy: Just like a software development team, there are product managers, back-end engineers, front-end engineers, and testers, each responsible for different aspects.

Typical Application:

  • Software development assistant (requirements analysis → design → coding → testing)
  • Investment research team (market analysis → financial analysis → risk assessment → investment advice)
  • Content creation team (planning → writing → editing → design)

Architecture Example:

                    ┌→ Research Agent → data
User request -> Coordinator Agent +-> Analysis Agent -> Insight
                    ├→ Writing Agent → content
                    └→ Review Agent → quality check

                         Integrated output

Code Example:

class SpecializedAgent:
    """specialized Agent base class"""
    def __init__(self, name, expertise):
        self.name = name
        self.expertise = expertise

    def process(self, task, context=None):
        prompt = f"""
        You are {self.name}, specialized in {self.expertise}.

        Task: {task}
        {f"Context: {context}" if context else ""}

        Complete this task based on your area of expertise.
        """

        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "agent": self.name,
            "result": response.choices[0].message.content
        }

class Level3MultiAgent:
    def __init__(self):
        self.coordinator = SpecializedAgent("Coordinator", "task decomposition and coordination")
        self.researcher = SpecializedAgent("Researcher", "information collection and research")
        self.analyst = SpecializedAgent("Analyst", "data analysis and insight")
        self.writer = SpecializedAgent("Writer", "content creation and writing")

    def execute(self, goal):
        """multi-agent collaborative execution"""
        # 1. the coordinator decomposes and assigns the task
        coordination_prompt = f"""
        Goal: {goal}

        Available agents:
        - Researcher: responsible for information collection
        - Analyst: responsible for data analysis
        - Writer: responsible for content creation

        Create a collaboration plan with each agent task, input, and output.
        """

        plan_response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": coordination_prompt}]
        )
        plan = plan_response.choices[0].message.content

        # 2. parallel execution, simplified example
        research_result = self.researcher.process(f"research topic: {goal}")

        # the analyst starts after the researcher finishes
        analysis_result = self.analyst.process(
            f"analyze the following research result: {research_result['result']}",
            context=research_result['result']
        )

        # the writer integrates the final output
        final_result = self.writer.process(
            f"write the final report based on this analysis: {analysis_result['result']}",
            context=f"research: {research_result['result']}\nanalysis: {analysis_result['result']}"
        )

        return {
            "plan": plan,
            "research": research_result,
            "analysis": analysis_result,
            "final": final_result
        }

Level 4: Self-Evolving System

At the highest level, agents can learn from experience, self-optimize strategies, improve their own tool usage, and even modify their own system configurations.

Core Improvements:

  • learn from experience
  • self-optimization strategy
  • Tool usage improvements
  • System configuration adaptive

This represents the ultimate vision of Agent technology—a truly autonomous system capable of continuous improvement.

Typical Application:

  • Self-optimizing customer service system
  • Adaptive Trading Strategies
  • Continuous learning research assistant

Core Mechanism:

Execute → observe result → analyze performance → identify improvements → adjust strategy → execute again
         ↑                                                  ↓
         └──────────────── feedback loop ←───────────────────────┘

3. Core architecture analysis: models, tools and orchestration layers

The white paper details the three core components of Agent. Understanding the relationship between these three is the key to designing the Agent system.

3.1 Model: Agent’s “brain”

The model is responsible for the Agent’s “thinking” process:

Core Responsibilities:

  • Reasoning: Understand the problem, analyze the situation, and make logical deductions
  • Planning: Develop action steps and determine the order of execution
  • Decision: Choosing which tools to use and when to stop
  • Reflection: Evaluate execution results, identify errors and correct them

Model selection matrix:

sceneRecommended modelreason
complex reasoning tasksGPT-4, Claude-3 OpusPowerful multi-step reasoning capabilities
code generationClaude-3.5 Sonnet, GPT-4Strong code understanding and generation skills
cost sensitive scenariosGPT-3.5, Claude-3 HaikuHigh cost performance
Long context handlingClaude-3 (200K), GPT-4 Turbo (128K)large context window
real-time interactionGPT-4 Turbolow latency

Prompt Word Engineering Principles:

  1. Clear role definition: Clarify the identity and responsibility boundaries of the Agent
  2. Context Sufficient: Provide sufficient background information and constraints
  3. Output format specification: Use structured output to facilitate subsequent processing
  4. Example guidance: Provide input and output examples to help model understanding

3.2 Tool: Agent’s “Hand”

Tools enable the Agent to interact with the world and are an extension of the Agent’s capabilities.

Information acquisition tools (Grounding):

# Search engine
search_tool = {
    "name": "web_search",
    "description": "search the internet for real-time information",
    "parameters": {
        "query": "search keywords",
        "num_results": "number of results to return"
    }
}

# Database query
database_tool = {
    "name": "query_database",
    "description": "query the internal database",
    "parameters": {
        "sql": "SQL query statement",
        "database": "target database"
    }
}

# RAG retrieval
rag_tool = {
    "name": "retrieve_knowledge",
    "description": "retrieve relevant information from the knowledge base",
    "parameters": {
        "query": "retrieval query",
        "top_k": "number of results to return"
    }
}

Operation execution tools (Action):

# Send email
email_tool = {
    "name": "send_email",
    "description": "Send email",
    "parameters": {
        "to": "recipient",
        "subject": "subject",
        "body": "content"
    }
}

# Create calendar event
calendar_tool = {
    "name": "create_event",
    "description": "Create calendar event",
    "parameters": {
        "title": "event title",
        "start_time": "start time",
        "end_time": "end time"
    }
}

# API call
generic_api_tool = {
    "name": "call_api",
    "description": "callAPI",
    "parameters": {
        "url": "API",
        "method": "HTTP method",
        "headers": "request headers",
        "body": "request body"
    }
}

Tool Design Best Practices:

  1. Atomic: Each tool only does one thing and does one thing well
  2. Impotence: The same input produces the same output, making it easy to retry
  3. Self-descriptive: Tool description should clearly express its purpose and parameters
  4. Error handling: clearly define error types and handling methods

3.3 Orchestration Layer: Agent’s “nervous system”

The orchestration layer is the core of the Agent system and is responsible for coordinating the interaction of models and tools.

Core Responsibilities:

  • Conversation state management: Maintain the context of multiple rounds of dialogue
  • Tool Call Orchestration: Handles the input/output transformation of tool calls
  • Context Window Management: Ensures that the context limits of the model are not exceeded
  • Error Handling and Retry: Handle various error situations gracefully
  • Human-machine collaborative handoff: Seek human confirmation when needed

Orchestration layer architecture:

┌─────────────────────────────────────────────────────────┐
│                    Orchestration Layer                  │
├─────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │   Context    │  │    Tool      │  │    Error     │  │
│  │   Manager    │  │   Executor   │  │   Handler    │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
├─────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │   Session    │  │   Planning   │  │   Human      │  │
│  │    Store     │  │    Engine    │  │   Handoff    │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
└─────────────────────────────────────────────────────────┘

Code Example - Complete Orchestrator:

import asyncio
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from enum import Enum

class TaskStatus(Enum):
    PENDING = "pending"
    IN_PROGRESS = "in_progress"
    COMPLETED = "completed"
    FAILED = "failed"

@dataclass
class Task:
    id: str
    description: str
    status: TaskStatus
    result: Optional[Any] = None
    error: Optional[str] = None

class Orchestrator:
    """Agent"""

    def __init__(self, llm_client, tools, session_store):
        self.llm = llm_client
        self.tools = {tool["function"]["name"]: tool for tool in tools}
        self.session_store = session_store
        self.max_iterations = 10  # prevent infinite loops

    async def execute(self, user_input: str, session_id: str) -> str:
        """Executeuser request"""
        # 1. load session context
        context = await self.load_context(session_id)

        # 2. build message history
        messages = context + [{"role": "user", "content": user_input}]

        # 3. execution loop
        iteration = 0
        while iteration < self.max_iterations:
            iteration += 1

            # call the LLM
            response = await self.call_llm(messages)

            # check whether the task is complete
            if not response.tool_calls:
                # no tool calls; return the result directly
                await self.save_context(session_id, messages + [response])
                return response.content

            # execute tool calls
            tool_results = await self.execute_tools(response.tool_calls)

            # append the result to message history
            messages.extend([response] + tool_results)

            # check whether human intervention is needed
            if await self.needs_human_handoff(messages):
                return await self.request_human_handoff(session_id, messages)

        # maximum iteration count exceeded
        return "The task is taking too long. Please simplify your request or try again later."

    async def call_llm(self, messages: List[Dict]) -> Any:
        """call the LLM"""
        return self.llm.chat.completions.create(
            model="gpt-4",
            messages=messages,
            tools=list(self.tools.values()),
            tool_choice="auto"
        ).choices[0].message

    async def execute_tools(self, tool_calls: List[Any]) -> List[Dict]:
        """execute tool calls"""
        results = []

        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_args = json.loads(tool_call.function.arguments)

            try:
                # actual tool execution logic
                result = await self.run_tool(function_name, function_args)
                results.append({
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": json.dumps(result)
                })
            except Exception as e:
                results.append({
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": json.dumps({"error": str(e)})
                })

        return results

    async def run_tool(self, name: str, args: Dict) -> Any:
        """tool"""
        # dispatch to the concrete implementation by tool name
        tool_implementations = {
            "search_web": self.search_web,
            "get_weather": self.get_weather,
            # ... more tools
        }

        if name in tool_implementations:
            return await tool_implementations[name](**args)
        else:
            raise ValueError(f"Unknown tool: {name}")

    async def needs_human_handoff(self, messages: List[Dict]) -> bool:
        """determine whether human handoff is needed"""
        # implement decision logic, for example:
        # - sensitive operation detected
        # - multiple retries failed
        # - the user explicitly requested human help
        return False

    async def load_context(self, session_id: str) -> List[Dict]:
        """load session context"""
        return await self.session_store.get(session_id, default=[])

    async def save_context(self, session_id: str, messages: List[Dict]):
        """save session context"""
        # truncate overly long context
        truncated = self.truncate_context(messages)
        await self.session_store.set(session_id, truncated)

    def truncate_context(self, messages: List[Dict], max_tokens: int = 8000) -> List[Dict]:
        """context token constraint"""
        # simplified implementation: keep the system message and recent messages
        system_msgs = [m for m in messages if m.get("role") == "system"]
        other_msgs = [m for m in messages if m.get("role") != "system"]

        # keep recent messages
        keep_count = min(len(other_msgs), 10)  # simplified logic
        return system_msgs + other_msgs[-keep_count:]

4. Key Considerations in Design Selection

4.1 Domain knowledge and role setting

It is crucial to set a clear persona and domain knowledge for your Agent:

Character setting template:

You are a professional {role}, specialized in {domain}.

Your responsibilities:
1. {responsibility1}
2. {responsibility2}
3. {responsibility3}

Your behavior guidelines:
- {guideline1}
- {guideline2}
- {guideline3}

Your constraints:
- must not{constraint1}
- must not{constraint2}

Example - Medical Consultation Agent:

You are a medical information assistant.

Your responsibilities:
1. Understand the user's symptoms and background.
2. Provide general health information and next-step suggestions.
3. Ask clarifying questions when the user's situation is unclear.

Your behavior guidelines:
- Use plain language.
- explain professional concepts in plain language
- Encourage professional care when risk signals appear.

Your constraints:
- you must not provide a diagnosis; advise the user to consult a qualified physician
- you must not recommend prescription medication or treatment plans
- you must not handle medical emergencies; advise the user to call emergency services

4.2 Context enhancement strategy

Short Term Memory Management:

class ShortTermMemory:
    def __init__(self, max_messages=20):
        self.messages = []
        self.max_messages = max_messages

    def add(self, message):
        self.messages.append(message)
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]

    def get_context(self):
        return self.messages

    def clear(self):
        self.messages = []

Long term memory implementation:

import hashlib

class LongTermMemory:
    def __init__(self, vector_store):
        self.vector_store = vector_store

    async def remember(self, user_id: str, key: str, value: str, importance: float = 1.0):
        """store long-term memory"""
        memory_id = hashlib.md5(f"{user_id}:{key}".encode()).hexdigest()

        await self.vector_store.upsert(
            ids=[memory_id],
            documents=[value],
            metadatas=[{
                "user_id": user_id,
                "key": key,
                "importance": importance,
                "timestamp": time.time()
            }]
        )

    async def recall(self, user_id: str, query: str, top_k: int = 5):
        """retrieve relevant memory"""
        results = await self.vector_store.query(
            query_texts=[query],
            where={"user_id": user_id},
            n_results=top_k
        )
        return results

4.3 Multi-Agent system design pattern

The white paper mentions several common multi-agent architectures:

1. Master-Worker:

Master Agent
    ├─ Worker 1 (data processing)
    ├─ Worker 2 (analysis)
└─ Worker 3 (generate)

Applicable scenarios: Tasks can be clearly decomposed into subtasks

2. Pipeline:

Input → Agent A → Agent B → Agent C → Output
()    (analysis)    (generate)

Applicable scenarios: Scenarios with fixed data processing processes

3. Competitive:

Input ┬→ Agent A ─┐
      ├→ Agent B ─┼→ Selector → Best Output
      └→ Agent C ─┘

Applicable scenarios: Scenarios that require multiple options to choose the best

4. Collaborative:

         ┌←→┐
Agent A ─┤  ├─ Agent B
         └←→┘
    ↕       ↕
Agent C ←→ Agent D

Applicable scenarios: Complex issues require multi-party negotiation


5. Agent operation and maintenance: structured response to uncertainty

Unlike traditional software, Agent’s behavior is non-deterministic. The white paper proposes the concept of “Agent Ops”.

5.1 Measuring key indicators

Core indicator system:

Indicator categorySpecific indicatorsCalculation methodtarget value
Mission accomplishedsuccess ratesuccessful mission/total mission> 90%
efficiencyaverage number of stepsTotal steps/completion tasks< 5 steps
costCost per taskToken fee + API fee< $0.1
Delayaverage response timeTotal time/number of requests< 3s
qualityUser satisfactionRating average> 4.0/5

Indicator monitoring code example:

from dataclasses import dataclass
from typing import List
import time

@dataclass
class AgentMetrics:
    total_requests: int = 0
    successful_requests: int = 0
    total_steps: int = 0
    total_latency: float = 0
    total_cost: float = 0
    user_ratings: List[float] = None

    def __post_init__(self):
        if self.user_ratings is None:
            self.user_ratings = []

    @property
    def success_rate(self):
        if self.total_requests == 0:
            return 0
        return self.successful_requests / self.total_requests

    @property
    def avg_steps(self):
        if self.successful_requests == 0:
            return 0
        return self.total_steps / self.successful_requests

    @property
    def avg_latency(self):
        if self.total_requests == 0:
            return 0
        return self.total_latency / self.total_requests

    @property
    def avg_cost(self):
        if self.total_requests == 0:
            return 0
        return self.total_cost / self.total_requests

    @property
    def avg_rating(self):
        if not self.user_ratings:
            return 0
        return sum(self.user_ratings) / len(self.user_ratings)

    def report(self):
        return f"""
        Agent performance report
        ==============
request: {self.total_requests}
: {self.success_rate:.2%}
step: {self.avg_steps:.1f}
: {self.avg_latency:.2f}s
Cost:${self.avg_cost:.4f}
User: {self.avg_rating:.2f}/5
        """

5.2 A/B testing thinking

Think of the Agent as a system that requires continuous experimentation:

A/B Testing Framework:

import random

class ABTestFramework:
    def __init__(self):
        self.experiments = {}

    def create_experiment(self, name, variants, traffic_split):
        """
create
        variants: {'control': control_config, 'treatment': treatment_config}
        traffic_split: {'control': 0.5, 'treatment': 0.5}
        """
        self.experiments[name] = {
            'variants': variants,
            'traffic_split': traffic_split,
            'metrics': {'control': [], 'treatment': []}
        }

    def get_variant(self, experiment_name, user_id):
        """barsUser ID assign"""
        exp = self.experiments[experiment_name]

        # use the user ID hash to keep assignment consistent
        hash_val = hash(f"{experiment_name}:{user_id}") % 100

        cumulative = 0
        for variant, split in exp['traffic_split'].items():
            cumulative += split * 100
            if hash_val < cumulative:
                return variant, exp['variants'][variant]

        return 'control', exp['variants']['control']

    def record_metric(self, experiment_name, variant, metric_value):
        """record experiment metrics"""
        self.experiments[experiment_name]['metrics'][variant].append(metric_value)

    def analyze_results(self, experiment_name):
        """analyze experiment results"""
        metrics = self.experiments[experiment_name]['metrics']

        control_metrics = metrics['control']
        treatment_metrics = metrics['treatment']

        # calculate statistical significance
        from scipy import stats
        t_stat, p_value = stats.ttest_ind(control_metrics, treatment_metrics)

        control_mean = sum(control_metrics) / len(control_metrics)
        treatment_mean = sum(treatment_metrics) / len(treatment_metrics)
        lift = (treatment_mean - control_mean) / control_mean

        return {
            'control_mean': control_mean,
            'treatment_mean': treatment_mean,
            'lift': lift,
            'p_value': p_value,
            'significant': p_value < 0.05
        }

5.3 LLM as judge

Use another LLM to evaluate the agent’s output quality:

class LLMJudge:
    def __init__(self, judge_model="gpt-4"):
        self.judge_model = judge_model

    async def evaluate(self, task_description, agent_output, criteria):
        """Agent output"""
        prompt = f"""
        You are an impartial judge. Evaluate the quality of the following Agent output.

        taskdescription:
        {task_description}

        Agent output:
        {agent_output}

:
        {criteria}

        Return the evaluation result in the following JSON format:
        {{
            "overall_score": 1-5,
            "dimension_scores": {{
                "accuracy": 1-5,
                "completeness": 1-5,
                "clarity": 1-5,
                "usefulness": 1-5
            }},
            "reasoning": "descriptionscore",
            "strengths": ["pros1", "pros2"],
            "weaknesses": ["insufficient1", "insufficient2"],
            "suggestions": ["suggestions1", "suggestions2"]
        }}
        """

        response = await self.call_llm(prompt)
        return json.loads(response)

    async def batch_evaluate(self, test_cases):
        """batch evaluation"""
        results = []
        for case in test_cases:
            result = await self.evaluate(
                case['task'],
                case['output'],
                case['criteria']
            )
            results.append(result)
        return results

5.4 Observability

Use OpenTelemetry to track the execution path of the Agent:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# configuration
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

otlp_exporter = OTLPSpanExporter(endpoint="your-collector-endpoint")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

class ObservableAgent:
    @tracer.start_as_current_span("agent_execution")
    async def execute(self, user_input):
        with tracer.start_as_current_span("intent_classification") as span:
            intent = await self.classify_intent(user_input)
            span.set_attribute("intent", intent)

        with tracer.start_as_current_span("tool_execution") as span:
            tools_used = []
            for tool_call in self.plan_tools(intent):
                with tracer.start_as_current_span(f"tool_{tool_call.name}"):
                    result = await self.execute_tool(tool_call)
                    tools_used.append(tool_call.name)
            span.set_attribute("tools_used", tools_used)

        with tracer.start_as_current_span("response_generation"):
            response = await self.generate_response()

        return response

6. Agent interoperability: connecting people, agents and businesses

The white paper concludes by exploring the interoperability of the Agent ecosystem.

6.1 Agent and people

Human-machine collaboration mode:

  1. Human-in-the-loop:

    • Agent requests confirmation before execution
    • Manual review of key decision points
    • Confirmation of complex tasks in stages
  2. Human-on-the-loop:

    • Agent executes autonomously
    • Human monitoring execution process
    • Manual intervention in abnormal situations
  3. Human-in-command:

    • Humans set high-level goals
    • Agent autonomous planning and execution
    • Report progress regularly

Handover design principles:

class HumanHandoff:
    def __init__(self):
        self.handoff_triggers = [
            self.check_sensitive_operation,
            self.check_confidence_threshold,
            self.check_user_frustration,
            self.check_repeated_failures
        ]

    async def should_handoff(self, context) -> bool:
        """determine whether human handoff is needed"""
        for trigger in self.handoff_triggers:
            if await trigger(context):
                return True
        return False

    async def handoff(self, context, reason):
        """Execute"""
        handoff_context = {
            "conversation_history": context.messages,
            "pending_tasks": context.pending_tasks,
            "reason": reason,
            "suggested_action": await self.suggest_action(context)
        }

        # human
        await self.notify_human_agent(handoff_context)

        return "human,..."

6.2 Agent and Agent

Cross-Agent Communication Protocol:

class AgentCommunicationProtocol:
    """Agent"""

    async def send_message(self, from_agent, to_agent, message_type, payload):
        """message"""
        message = {
            "from": from_agent,
            "to": to_agent,
            "type": message_type,  # request, response, broadcast
            "payload": payload,
            "timestamp": time.time(),
            "message_id": generate_id()
        }

        await self.message_bus.send(message)

    async def request_capability(self, agent_id, capability_requirement):
        """requestother Agent capability"""
        available_agents = await self.discover_agents(capability_requirement)

        if not available_agents:
            raise NoAgentAvailable(capability_requirement)

        # Agent
        selected = self.select_best_agent(available_agents)

        # request
        response = await self.send_request(selected, capability_requirement)
        return response

6.3 Agent and business

Agent Economic Model:

  1. Pay-per-use:

    • Billed per API call
    • Billed based on Token consumption
    • Billing based on task completion
  2. Subscription mode (Subscription):

    • Basic Edition/Professional Edition/Enterprise Edition
    • Functional classification
    • Usage quota
  3. Outcome-based:

    • Billing based on business results
    • Pay only if you succeed
    • risk sharing

7. Practical Enlightenment

7.1 Getting started suggestions

Progressive evolution path:

Phase 1 (1-2 weeks): Level 1 Basic Agent

  • Implement single Agent + 3-5 basic tools
  • Complete a specific scenario (such as Q&A assistant)
  • Establish basic monitoring

Phase 2 (1 month): Level 2 Planning Agent

  • Add task decomposition capability
  • Implement multi-step task execution
  • Establish an evaluation system

Phase 3 (2-3 months): Level 3 multi-agent system

  • Designing a multi-agent architecture
  • Implement collaboration between agents
  • Improve the operation and maintenance system

Phase 4 (6 months+): Level 4 self-evolving system

  • Add learning capabilities
  • Achieve self-optimization
  • Establish a closed feedback loop

7.2 Common pitfalls

Trap 1: Over-Engineering

  • Symptom: Designing complex multi-agent architectures for simple tasks
  • Solution: Start simple and evolve as needed

Trap 2: Neglecting Assessment

  • Symptom: After going online, it is found that the quality is not up to standard
  • Solution: Establish an evaluation system and data-driven iteration

Trap 3: Prompt words are hardcoded

  • Symptom: Prompt words scattered throughout the code
  • Solution: Use prompt word management system

Trap 4: Lack of Security Considerations

  • Symptom: Agent performs dangerous operations
  • Solution: Establish permission control and audit mechanism

Trap 5: Ignoring user experience

  • Symptom: Agent is powerful but difficult to use
  • Solution: Continue to collect user feedback

7.3 Tool chain recommendation

categorytooluse
development frameworkLangChain, LlamaIndexAgent development
Prompt word managementPromptLayer, Weights & BiasesPrompt word version management
Assessment testOpenAI Evals, Promptflowquality assessment
ObservabilityLangSmith, LangfuseMonitoring and tracking
deployDocker, KubernetesProduction deployment

8. Future Outlook

Agent technology is developing rapidly and we can expect:

  1. More powerful inference models: Inference models such as o1 will significantly improve the Agent’s planning capabilities
  2. Standardized Agent Protocol: MCP and other protocols will become industry standards
  3. Native Agent Support: Operating systems and browsers will natively support Agent
  4. Agent Market: There will be an Agent Market similar to the App Store
  5. New model of human-machine collaboration: Agent will become a standard “colleague” of humans

Reference and further reading

  • Original text: Kaggle Introduction to Agents Whitepaper
  • Authors: Alan Blount, Antonio Gulli, Shubham Saboo, Michael Zimmermann, Vladimir Vuskovic
  • Release date: November 2025
  • Related projects: LangChain, AutoGPT, MetaGPT

*This article adopts the CC BY-NC-SA 4.0 license agreement. If you need to reprint, please indicate the source. *

Reading path

Continue along this topic path

Follow the recommended order for AI engineering practice instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Return to topic Subscribe via RSS

RSS Subscribe

Subscribe to updates

Follow new articles in an RSS reader without checking the site manually.

Recommended readers include Follow , Feedly or Inoreader and other RSS readers.

Comments and discussion

Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions

Loading comments...