Article

Original interpretation: Kaggle white paper "Introduction to Agents" - AI Agent introduction and architecture panorama

In-depth analysis of the five levels, core architecture and production practices of Agent, and sorting out the key framework and inspiration of the Kaggle white paper "Introduction to Agents"

Topic · AI engineering practice

AI Agent LLM Multi Agent System Kaggle Architecture Design Original Interpretation

Introduction: From a model that can answer questions to an Agent that can complete tasks

In the past few years, the most amazing capabilities of large models have been mainly reflected in “answering questions” and “generating content”: they can write code, make summaries, generate copywriting, and can also give almost expert-level explanations in many scenarios. But if we switch the perspective from “generating an answer” to “accomplishing a goal”, we will immediately discover the boundaries of traditional LLM - it can provide suggestions, but it may not be able to actively plan, call tools, decompose tasks and execute them all the way to the results.

The real value of the Kaggle white paper “Introduction to Agents” is that it clearly explains this difference: **Agent is not as simple as connecting LLM to a certain workflow, but a complete application system that can continuously make decisions and actions around goals. **

This article will not be repeated paragraph by paragraph in the order of the white paper, but from the perspective of engineering practice, reorganizing the most noteworthy parts: Why is Agent a paradigm shift? How to understand its five levels? From basic reasoning and tool connection to multi-step planning and system collaboration, how does the complexity of the architecture increase step by step? And, why should these concepts directly impact the way you build AI products in the future?

1. From predictive AI to autonomous agents: a paradigm shift

In the past few years, the development focus of AI has always been around the passive, discrete task model - translating a paragraph, answering a question, and generating a picture based on prompts. Although this paradigm is powerful, it requires explicit instructions from humans at every step.

We are now witnessing a paradigm shift from AI that can only predict or create content to new types of software that can solve problems and perform tasks autonomously.

This is the core idea of AI Agent.

1.1 Paradigm comparison: from passive to active

Let’s understand this difference through a specific scenario:

Traditional AI (predictive):

User: What will the weather be in Beijing tomorrow?
AI: Beijing will be cloudy tomorrow, with temperatures from 15 to 22 degrees.
(conversation ends; wait for the next instruction)

Agent (autonomous):

User: I want to visit Beijing this weekend. Any suggestions?
Agent:
1. check Beijing weekend weather forecast
2. search popular Beijing weekend attractions
3. recommend outdoor activities based on the weather
4. check traffic information and recommend transport options
5. combine the information into a complete itinerary

Here is a complete recommendation: because Saturday is sunny, visit the Great Wall in the morning and the Summer Palace in the afternoon...

Agents don’t just answer questions, but actively plan and execute a series of actions to achieve goals.

1.2 What exactly is Agent?

Many people mistakenly believe that Agent embeds LLM into workflow. But in Kaggle’s authoritative 54-page white paper, the authors provide a more precise definition:

The Agent is not an AI model in a static workflow, but a complete application system - one that can make plans and take actions to achieve goals.

Agents combine the reasoning capabilities of language models with practical action capabilities, enabling them to handle complex, multi-step tasks that cannot be accomplished by the model itself.

The key capability is: Agent can work independently and independently figure out the subsequent steps required to achieve the goal without human guidance at every step.

1.3 Why is Agent important now?

There are three key factors driving the rise of Agent:

Leap in model capabilities: Large models such as GPT-4 and Claude have powerful reasoning and planning capabilities
Mature tool ecosystem: Protocols such as Function Calling and MCP standardize the connection between models and the external world.
Actual demand driven: Enterprises need AI systems that can automate complex business processes

2. Five levels of Agent: from reasoning to self-evolution

The white paper proposes a very valuable Taxonomy, which divides the Agent system into 5 levels. This classification not only helps us understand the evolution path of Agent, but also provides a clear reference framework for architecture design.

Level 0: Core Reasoning System

This is the base layer and contains only the language model itself. It can reason and answer questions, but has no ability to interact with the outside world.

Capability Boundary:

Can answer knowledge questions
Able to reason logically
Can generate text and code
Unable to obtain real-time information
Unable to perform external operation

Typical Application:

Internal knowledge Q&A (based on training data)
Text generation and polishing
Simple logical reasoning tasks

Architecture Example:

user input → LLM → model output

Code Example:

from openai import OpenAI

client = OpenAI()

class Level0Agent:
    def __init__(self):
        self.system_prompt = "You are a helpful assistant."

    def chat(self, user_message):
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": user_message}
            ]
        )
        return response.choices[0].message.content

Level 1: Connected Problem-Solver

Based on Level 0, tool calling capability is added. Agents can search web pages, query databases, and call APIs to obtain information or perform actions.

Core Improvements:

Ability to access real-time information
Can call external API
Have basic data acquisition capabilities

Typical Application:

Q&A assistant with search function
Weather query robot
Stock information query

Architecture Example:

user input → LLM → Need tool? → Call tool → merge results → output
                ↓ No      ↑
                └────────→ Direct output

Code Example:

import json
from openai import OpenAI

client = OpenAI()

class Level1Agent:
    def __init__(self):
        self.tools = [
            {
                "type": "function",
                "function": {
                    "name": "search_web",
                    "description": "search the web for real-time information",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string"}
                        },
                        "required": ["query"]
                    }
                }
            },
            {
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "description": "get weather for the specified city",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "city": {"type": "string"},
                            "date": {"type": "string"}
                        },
                        "required": ["city"]
                    }
                }
            }
        ]

    def execute_tool(self, tool_name, params):
        if tool_name == "get_weather":
            # real implementation would call a weather API
            return {"temperature": 22, "condition": "sunny"}
        return {}

    def chat(self, user_message):
        messages = [{"role": "user", "content": user_message}]

        # first call: request tool calls
        response = client.chat.completions.create(
            model="gpt-4",
            messages=messages,
            tools=self.tools,
            tool_choice="auto"
        )

        message = response.choices[0].message

        # check whether tool calls are needed
        if message.tool_calls:
            # execute tool calls
            tool_results = []
            for tool_call in message.tool_calls:
                function_name = tool_call.function.name
                function_params = json.loads(tool_call.function.arguments)
                result = self.execute_tool(function_name, function_params)
                tool_results.append({
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": json.dumps(result)
                })

            # append tool results to the conversation
            messages.append(message)
            messages.extend(tool_results)

            # second call: get the final response
            final_response = client.chat.completions.create(
                model="gpt-4",
                messages=messages
            )
            return final_response.choices[0].message.content

        return message.content

Level 2: Strategic Problem-Solver

This layer introduces planning capabilities and multi-step reasoning. Agent can decompose complex tasks into subtasks and execute them in strategic order.

Core Improvements:

Task Decomposition
Multi-step planning
error recovery mechanism
Processing and integration of intermediate results

Typical Application:

Research report generation (search → analysis → writing)
Data analysis tasks (obtaining data → cleaning → analysis → visualization)
Complex problem diagnosis

Architecture Example:

User goal -> Task decomposer -> Subtask 1 -> Execute -> Result
                     ↓
                     ├→ Subtask2 → Execute → result
                     ↓
                     ├→ Subtask3 → Execute → result
                     ↓
                     Result combiner → Final output

Code Example:

class Level2Agent:
    def __init__(self):
        self.level1_agent = Level1Agent()

    def decompose_task(self, goal):
        """decompose a complex task into subtasks"""
        decomposition_prompt = f"""
        decompose the following goal into executable steps:
        Goal: {goal}

        Requirements:
        1. each step should be executable
        2. dependencies between steps must be explicit
        3. output format is a JSON array

        Example output:
        [
            {"step": 1, "action": "search related material", "depends_on": []},
            {"step": 2, "action": "analyze collected information", "depends_on": [1]},
            {"step": 3, "action": "generate a report", "depends_on": [2]}
        ]
        """

        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": decomposition_prompt}]
        )

        steps = json.loads(response.choices[0].message.content)
        return steps

    def execute_task(self, goal):
        """execute complex task"""
        # 1. task decomposition
        steps = self.decompose_task(goal)

        # 2. execute in dependency order
        results = {}
        for step in steps:
            step_num = step["step"]
            action = step["action"]

            # build context, including previous step results
            context = f"Task: {action}\n"
            if step.get("depends_on"):
                for dep in step["depends_on"]:
                    context += f"step{dep} result: {results.get(dep, '')}\n"

            # execute step
            result = self.level1_agent.chat(context)
            results[step_num] = result

            print(f"step {step_num} complete: {action}")

        # 3. merge results
        final_result = self.synthesize_results(goal, results)
        return final_result

    def synthesize_results(self, goal, results):
        """merge results from all steps"""
        synthesis_prompt = f"""
        Goal: {goal}

        Step execution results:
        {json.dumps(results, ensure_ascii=False, indent=2)}

        Merge the results above into the final output.
        """

        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": synthesis_prompt}]
        )

        return response.choices[0].message.content

Level 3: Collaborative Multi-Agent System

Multiple Agents work together, and each Agent may be specialized in a specific field. They can work together, communicate with each other, and coordinate complex parallel tasks.

Core Improvements:

Specialized division of labor
Communication between agents
Parallel execution
Results coordination and integration

Analogy: Just like a software development team, there are product managers, back-end engineers, front-end engineers, and testers, each responsible for different aspects.

Typical Application:

Software development assistant (requirements analysis → design → coding → testing)
Investment research team (market analysis → financial analysis → risk assessment → investment advice)
Content creation team (planning → writing → editing → design)

Architecture Example:

                    ┌→ Research Agent → data
User request -> Coordinator Agent +-> Analysis Agent -> Insight
                    ├→ Writing Agent → content
                    └→ Review Agent → quality check
                              ↓
                         Integrated output

Code Example:

class SpecializedAgent:
    """specialized Agent base class"""
    def __init__(self, name, expertise):
        self.name = name
        self.expertise = expertise

    def process(self, task, context=None):
        prompt = f"""
        You are {self.name}, specialized in {self.expertise}.

        Task: {task}
        {f"Context: {context}" if context else ""}

        Complete this task based on your area of expertise.
        """

        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )

        return {
            "agent": self.name,
            "result": response.choices[0].message.content
        }

class Level3MultiAgent:
    def __init__(self):
        self.coordinator = SpecializedAgent("Coordinator", "task decomposition and coordination")
        self.researcher = SpecializedAgent("Researcher", "information collection and research")
        self.analyst = SpecializedAgent("Analyst", "data analysis and insight")
        self.writer = SpecializedAgent("Writer", "content creation and writing")

    def execute(self, goal):
        """multi-agent collaborative execution"""
        # 1. the coordinator decomposes and assigns the task
        coordination_prompt = f"""
        Goal: {goal}

        Available agents:
        - Researcher: responsible for information collection
        - Analyst: responsible for data analysis
        - Writer: responsible for content creation

        Create a collaboration plan with each agent task, input, and output.
        """

        plan_response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": coordination_prompt}]
        )
        plan = plan_response.choices[0].message.content

        # 2. parallel execution, simplified example
        research_result = self.researcher.process(f"research topic: {goal}")

        # the analyst starts after the researcher finishes
        analysis_result = self.analyst.process(
            f"analyze the following research result: {research_result['result']}",
            context=research_result['result']
        )

        # the writer integrates the final output
        final_result = self.writer.process(
            f"write the final report based on this analysis: {analysis_result['result']}",
            context=f"research: {research_result['result']}\nanalysis: {analysis_result['result']}"
        )

        return {
            "plan": plan,
            "research": research_result,
            "analysis": analysis_result,
            "final": final_result
        }

Level 4: Self-Evolving System

At the highest level, agents can learn from experience, self-optimize strategies, improve their own tool usage, and even modify their own system configurations.

Core Improvements:

learn from experience
self-optimization strategy
Tool usage improvements
System configuration adaptive

This represents the ultimate vision of Agent technology—a truly autonomous system capable of continuous improvement.

Typical Application:

Self-optimizing customer service system
Adaptive Trading Strategies
Continuous learning research assistant

Core Mechanism:

Execute → observe result → analyze performance → identify improvements → adjust strategy → execute again
         ↑                                                  ↓
         └──────────────── feedback loop ←───────────────────────┘

3. Core architecture analysis: models, tools and orchestration layers

The white paper details the three core components of Agent. Understanding the relationship between these three is the key to designing the Agent system.

3.1 Model: Agent’s “brain”

The model is responsible for the Agent’s “thinking” process:

Core Responsibilities:

Reasoning: Understand the problem, analyze the situation, and make logical deductions
Planning: Develop action steps and determine the order of execution
Decision: Choosing which tools to use and when to stop
Reflection: Evaluate execution results, identify errors and correct them

Model selection matrix:

scene	Recommended model	reason
complex reasoning tasks	GPT-4, Claude-3 Opus	Powerful multi-step reasoning capabilities
code generation	Claude-3.5 Sonnet, GPT-4	Strong code understanding and generation skills
cost sensitive scenarios	GPT-3.5, Claude-3 Haiku	High cost performance
Long context handling	Claude-3 (200K), GPT-4 Turbo (128K)	large context window
real-time interaction	GPT-4 Turbo	low latency

Prompt Word Engineering Principles:

Clear role definition: Clarify the identity and responsibility boundaries of the Agent
Context Sufficient: Provide sufficient background information and constraints
Output format specification: Use structured output to facilitate subsequent processing
Example guidance: Provide input and output examples to help model understanding

3.2 Tool: Agent’s “Hand”

Tools enable the Agent to interact with the world and are an extension of the Agent’s capabilities.

Information acquisition tools (Grounding):

# Search engine
search_tool = {
    "name": "web_search",
    "description": "search the internet for real-time information",
    "parameters": {
        "query": "search keywords",
        "num_results": "number of results to return"
    }
}

# Database query
database_tool = {
    "name": "query_database",
    "description": "query the internal database",
    "parameters": {
        "sql": "SQL query statement",
        "database": "target database"
    }
}

# RAG retrieval
rag_tool = {
    "name": "retrieve_knowledge",
    "description": "retrieve relevant information from the knowledge base",
    "parameters": {
        "query": "retrieval query",
        "top_k": "number of results to return"
    }
}

Operation execution tools (Action):

# Send email
email_tool = {
    "name": "send_email",
    "description": "Send email",
    "parameters": {
        "to": "recipient",
        "subject": "subject",
        "body": "content"
    }
}

# Create calendar event
calendar_tool = {
    "name": "create_event",
    "description": "Create calendar event",
    "parameters": {
        "title": "event title",
        "start_time": "start time",
        "end_time": "end time"
    }
}

# API call
generic_api_tool = {
    "name": "call_api",
    "description": "callAPI",
    "parameters": {
        "url": "API",
        "method": "HTTP method",
        "headers": "request headers",
        "body": "request body"
    }
}

Tool Design Best Practices:

Atomic: Each tool only does one thing and does one thing well
Impotence: The same input produces the same output, making it easy to retry
Self-descriptive: Tool description should clearly express its purpose and parameters
Error handling: clearly define error types and handling methods

3.3 Orchestration Layer: Agent’s “nervous system”

The orchestration layer is the core of the Agent system and is responsible for coordinating the interaction of models and tools.

Core Responsibilities:

Conversation state management: Maintain the context of multiple rounds of dialogue
Tool Call Orchestration: Handles the input/output transformation of tool calls
Context Window Management: Ensures that the context limits of the model are not exceeded
Error Handling and Retry: Handle various error situations gracefully
Human-machine collaborative handoff: Seek human confirmation when needed

Orchestration layer architecture:

┌─────────────────────────────────────────────────────────┐
│                    Orchestration Layer                  │
├─────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │   Context    │  │    Tool      │  │    Error     │  │
│  │   Manager    │  │   Executor   │  │   Handler    │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
├─────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │   Session    │  │   Planning   │  │   Human      │  │
│  │    Store     │  │    Engine    │  │   Handoff    │  │
│  └──────────────┘  └──────────────┘  └──────────────┘  │
└─────────────────────────────────────────────────────────┘

Code Example - Complete Orchestrator:

import asyncio
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from enum import Enum

class TaskStatus(Enum):
    PENDING = "pending"
    IN_PROGRESS = "in_progress"
    COMPLETED = "completed"
    FAILED = "failed"

@dataclass
class Task:
    id: str
    description: str
    status: TaskStatus
    result: Optional[Any] = None
    error: Optional[str] = None

class Orchestrator:
    """Agent"""

    def __init__(self, llm_client, tools, session_store):
        self.llm = llm_client
        self.tools = {tool["function"]["name"]: tool for tool in tools}
        self.session_store = session_store
        self.max_iterations = 10  # prevent infinite loops

    async def execute(self, user_input: str, session_id: str) -> str:
        """Executeuser request"""
        # 1. load session context
        context = await self.load_context(session_id)

        # 2. build message history
        messages = context + [{"role": "user", "content": user_input}]

        # 3. execution loop
        iteration = 0
        while iteration < self.max_iterations:
            iteration += 1

            # call the LLM
            response = await self.call_llm(messages)

            # check whether the task is complete
            if not response.tool_calls:
                # no tool calls; return the result directly
                await self.save_context(session_id, messages + [response])
                return response.content

            # execute tool calls
            tool_results = await self.execute_tools(response.tool_calls)

            # append the result to message history
            messages.extend([response] + tool_results)

            # check whether human intervention is needed
            if await self.needs_human_handoff(messages):
                return await self.request_human_handoff(session_id, messages)

        # maximum iteration count exceeded
        return "The task is taking too long. Please simplify your request or try again later."

    async def call_llm(self, messages: List[Dict]) -> Any:
        """call the LLM"""
        return self.llm.chat.completions.create(
            model="gpt-4",
            messages=messages,
            tools=list(self.tools.values()),
            tool_choice="auto"
        ).choices[0].message

    async def execute_tools(self, tool_calls: List[Any]) -> List[Dict]:
        """execute tool calls"""
        results = []

        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_args = json.loads(tool_call.function.arguments)

            try:
                # actual tool execution logic
                result = await self.run_tool(function_name, function_args)
                results.append({
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": json.dumps(result)
                })
            except Exception as e:
                results.append({
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": json.dumps({"error": str(e)})
                })

        return results

    async def run_tool(self, name: str, args: Dict) -> Any:
        """tool"""
        # dispatch to the concrete implementation by tool name
        tool_implementations = {
            "search_web": self.search_web,
            "get_weather": self.get_weather,
            # ... more tools
        }

        if name in tool_implementations:
            return await tool_implementations[name](**args)
        else:
            raise ValueError(f"Unknown tool: {name}")

    async def needs_human_handoff(self, messages: List[Dict]) -> bool:
        """determine whether human handoff is needed"""
        # implement decision logic, for example:
        # - sensitive operation detected
        # - multiple retries failed
        # - the user explicitly requested human help
        return False

    async def load_context(self, session_id: str) -> List[Dict]:
        """load session context"""
        return await self.session_store.get(session_id, default=[])

    async def save_context(self, session_id: str, messages: List[Dict]):
        """save session context"""
        # truncate overly long context
        truncated = self.truncate_context(messages)
        await self.session_store.set(session_id, truncated)

    def truncate_context(self, messages: List[Dict], max_tokens: int = 8000) -> List[Dict]:
        """context token constraint"""
        # simplified implementation: keep the system message and recent messages
        system_msgs = [m for m in messages if m.get("role") == "system"]
        other_msgs = [m for m in messages if m.get("role") != "system"]

        # keep recent messages
        keep_count = min(len(other_msgs), 10)  # simplified logic
        return system_msgs + other_msgs[-keep_count:]

4. Key Considerations in Design Selection

4.1 Domain knowledge and role setting

It is crucial to set a clear persona and domain knowledge for your Agent:

Character setting template:

You are a professional {role}, specialized in {domain}.

Your responsibilities:
1. {responsibility1}
2. {responsibility2}
3. {responsibility3}

Your behavior guidelines:
- {guideline1}
- {guideline2}
- {guideline3}

Your constraints:
- must not{constraint1}
- must not{constraint2}

Example - Medical Consultation Agent:

You are a medical information assistant.

Your responsibilities:
1. Understand the user's symptoms and background.
2. Provide general health information and next-step suggestions.
3. Ask clarifying questions when the user's situation is unclear.

Your behavior guidelines:
- Use plain language.
- explain professional concepts in plain language
- Encourage professional care when risk signals appear.

Your constraints:
- you must not provide a diagnosis; advise the user to consult a qualified physician
- you must not recommend prescription medication or treatment plans
- you must not handle medical emergencies; advise the user to call emergency services

4.2 Context enhancement strategy

Short Term Memory Management:

class ShortTermMemory:
    def __init__(self, max_messages=20):
        self.messages = []
        self.max_messages = max_messages

    def add(self, message):
        self.messages.append(message)
        if len(self.messages) > self.max_messages:
            self.messages = self.messages[-self.max_messages:]

    def get_context(self):
        return self.messages

    def clear(self):
        self.messages = []

Long term memory implementation:

import hashlib

class LongTermMemory:
    def __init__(self, vector_store):
        self.vector_store = vector_store

    async def remember(self, user_id: str, key: str, value: str, importance: float = 1.0):
        """store long-term memory"""
        memory_id = hashlib.md5(f"{user_id}:{key}".encode()).hexdigest()

        await self.vector_store.upsert(
            ids=[memory_id],
            documents=[value],
            metadatas=[{
                "user_id": user_id,
                "key": key,
                "importance": importance,
                "timestamp": time.time()
            }]
        )

    async def recall(self, user_id: str, query: str, top_k: int = 5):
        """retrieve relevant memory"""
        results = await self.vector_store.query(
            query_texts=[query],
            where={"user_id": user_id},
            n_results=top_k
        )
        return results

4.3 Multi-Agent system design pattern

The white paper mentions several common multi-agent architectures:

1. Master-Worker:

Master Agent
    ├─ Worker 1 (data processing)
    ├─ Worker 2 (analysis)
└─ Worker 3 (generate)

Applicable scenarios: Tasks can be clearly decomposed into subtasks

2. Pipeline:

Input → Agent A → Agent B → Agent C → Output
()    (analysis)    (generate)

Applicable scenarios: Scenarios with fixed data processing processes

3. Competitive:

Input ┬→ Agent A ─┐
      ├→ Agent B ─┼→ Selector → Best Output
      └→ Agent C ─┘

Applicable scenarios: Scenarios that require multiple options to choose the best

4. Collaborative:

         ┌←→┐
Agent A ─┤  ├─ Agent B
         └←→┘
    ↕       ↕
Agent C ←→ Agent D

Applicable scenarios: Complex issues require multi-party negotiation

5. Agent operation and maintenance: structured response to uncertainty

Unlike traditional software, Agent’s behavior is non-deterministic. The white paper proposes the concept of “Agent Ops”.

5.1 Measuring key indicators

Core indicator system:

Indicator category	Specific indicators	Calculation method	target value
Mission accomplished	success rate	successful mission/total mission	> 90%
efficiency	average number of steps	Total steps/completion tasks	< 5 steps
cost	Cost per task	Token fee + API fee	< $0.1
Delay	average response time	Total time/number of requests	< 3s
quality	User satisfaction	Rating average	> 4.0/5

Indicator monitoring code example:

from dataclasses import dataclass
from typing import List
import time

@dataclass
class AgentMetrics:
    total_requests: int = 0
    successful_requests: int = 0
    total_steps: int = 0
    total_latency: float = 0
    total_cost: float = 0
    user_ratings: List[float] = None

    def __post_init__(self):
        if self.user_ratings is None:
            self.user_ratings = []

    @property
    def success_rate(self):
        if self.total_requests == 0:
            return 0
        return self.successful_requests / self.total_requests

    @property
    def avg_steps(self):
        if self.successful_requests == 0:
            return 0
        return self.total_steps / self.successful_requests

    @property
    def avg_latency(self):
        if self.total_requests == 0:
            return 0
        return self.total_latency / self.total_requests

    @property
    def avg_cost(self):
        if self.total_requests == 0:
            return 0
        return self.total_cost / self.total_requests

    @property
    def avg_rating(self):
        if not self.user_ratings:
            return 0
        return sum(self.user_ratings) / len(self.user_ratings)

    def report(self):
        return f"""
        Agent performance report
        ==============
request: {self.total_requests}
: {self.success_rate:.2%}
step: {self.avg_steps:.1f}
: {self.avg_latency:.2f}s
Cost:${self.avg_cost:.4f}
User: {self.avg_rating:.2f}/5
        """

5.2 A/B testing thinking

Think of the Agent as a system that requires continuous experimentation:

A/B Testing Framework:

import random

class ABTestFramework:
    def __init__(self):
        self.experiments = {}

    def create_experiment(self, name, variants, traffic_split):
        """
create
        variants: {'control': control_config, 'treatment': treatment_config}
        traffic_split: {'control': 0.5, 'treatment': 0.5}
        """
        self.experiments[name] = {
            'variants': variants,
            'traffic_split': traffic_split,
            'metrics': {'control': [], 'treatment': []}
        }

    def get_variant(self, experiment_name, user_id):
        """barsUser ID assign"""
        exp = self.experiments[experiment_name]

        # use the user ID hash to keep assignment consistent
        hash_val = hash(f"{experiment_name}:{user_id}") % 100

        cumulative = 0
        for variant, split in exp['traffic_split'].items():
            cumulative += split * 100
            if hash_val < cumulative:
                return variant, exp['variants'][variant]

        return 'control', exp['variants']['control']

    def record_metric(self, experiment_name, variant, metric_value):
        """record experiment metrics"""
        self.experiments[experiment_name]['metrics'][variant].append(metric_value)

    def analyze_results(self, experiment_name):
        """analyze experiment results"""
        metrics = self.experiments[experiment_name]['metrics']

        control_metrics = metrics['control']
        treatment_metrics = metrics['treatment']

        # calculate statistical significance
        from scipy import stats
        t_stat, p_value = stats.ttest_ind(control_metrics, treatment_metrics)

        control_mean = sum(control_metrics) / len(control_metrics)
        treatment_mean = sum(treatment_metrics) / len(treatment_metrics)
        lift = (treatment_mean - control_mean) / control_mean

        return {
            'control_mean': control_mean,
            'treatment_mean': treatment_mean,
            'lift': lift,
            'p_value': p_value,
            'significant': p_value < 0.05
        }

5.3 LLM as judge

Use another LLM to evaluate the agent’s output quality:

class LLMJudge:
    def __init__(self, judge_model="gpt-4"):
        self.judge_model = judge_model

    async def evaluate(self, task_description, agent_output, criteria):
        """Agent output"""
        prompt = f"""
        You are an impartial judge. Evaluate the quality of the following Agent output.

        taskdescription:
        {task_description}

        Agent output:
        {agent_output}

:
        {criteria}

        Return the evaluation result in the following JSON format:
        {{
            "overall_score": 1-5,
            "dimension_scores": {{
                "accuracy": 1-5,
                "completeness": 1-5,
                "clarity": 1-5,
                "usefulness": 1-5
            }},
            "reasoning": "descriptionscore",
            "strengths": ["pros1", "pros2"],
            "weaknesses": ["insufficient1", "insufficient2"],
            "suggestions": ["suggestions1", "suggestions2"]
        }}
        """

        response = await self.call_llm(prompt)
        return json.loads(response)

    async def batch_evaluate(self, test_cases):
        """batch evaluation"""
        results = []
        for case in test_cases:
            result = await self.evaluate(
                case['task'],
                case['output'],
                case['criteria']
            )
            results.append(result)
        return results

5.4 Observability

Use OpenTelemetry to track the execution path of the Agent:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# configuration
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

otlp_exporter = OTLPSpanExporter(endpoint="your-collector-endpoint")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

class ObservableAgent:
    @tracer.start_as_current_span("agent_execution")
    async def execute(self, user_input):
        with tracer.start_as_current_span("intent_classification") as span:
            intent = await self.classify_intent(user_input)
            span.set_attribute("intent", intent)

        with tracer.start_as_current_span("tool_execution") as span:
            tools_used = []
            for tool_call in self.plan_tools(intent):
                with tracer.start_as_current_span(f"tool_{tool_call.name}"):
                    result = await self.execute_tool(tool_call)
                    tools_used.append(tool_call.name)
            span.set_attribute("tools_used", tools_used)

        with tracer.start_as_current_span("response_generation"):
            response = await self.generate_response()

        return response

6. Agent interoperability: connecting people, agents and businesses

The white paper concludes by exploring the interoperability of the Agent ecosystem.

6.1 Agent and people

Human-machine collaboration mode:

Human-in-the-loop:
- Agent requests confirmation before execution
- Manual review of key decision points
- Confirmation of complex tasks in stages
Human-on-the-loop:
- Agent executes autonomously
- Human monitoring execution process
- Manual intervention in abnormal situations
Human-in-command:
- Humans set high-level goals
- Agent autonomous planning and execution
- Report progress regularly

Handover design principles:

class HumanHandoff:
    def __init__(self):
        self.handoff_triggers = [
            self.check_sensitive_operation,
            self.check_confidence_threshold,
            self.check_user_frustration,
            self.check_repeated_failures
        ]

    async def should_handoff(self, context) -> bool:
        """determine whether human handoff is needed"""
        for trigger in self.handoff_triggers:
            if await trigger(context):
                return True
        return False

    async def handoff(self, context, reason):
        """Execute"""
        handoff_context = {
            "conversation_history": context.messages,
            "pending_tasks": context.pending_tasks,
            "reason": reason,
            "suggested_action": await self.suggest_action(context)
        }

        # human
        await self.notify_human_agent(handoff_context)

        return "human,..."

6.2 Agent and Agent

Cross-Agent Communication Protocol:

class AgentCommunicationProtocol:
    """Agent"""

    async def send_message(self, from_agent, to_agent, message_type, payload):
        """message"""
        message = {
            "from": from_agent,
            "to": to_agent,
            "type": message_type,  # request, response, broadcast
            "payload": payload,
            "timestamp": time.time(),
            "message_id": generate_id()
        }

        await self.message_bus.send(message)

    async def request_capability(self, agent_id, capability_requirement):
        """requestother Agent capability"""
        available_agents = await self.discover_agents(capability_requirement)

        if not available_agents:
            raise NoAgentAvailable(capability_requirement)

        # Agent
        selected = self.select_best_agent(available_agents)

        # request
        response = await self.send_request(selected, capability_requirement)
        return response

6.3 Agent and business

Agent Economic Model:

Pay-per-use:
- Billed per API call
- Billed based on Token consumption
- Billing based on task completion
Subscription mode (Subscription):
- Basic Edition/Professional Edition/Enterprise Edition
- Functional classification
- Usage quota
Outcome-based:
- Billing based on business results
- Pay only if you succeed
- risk sharing

7. Practical Enlightenment

7.1 Getting started suggestions

Progressive evolution path:

Phase 1 (1-2 weeks): Level 1 Basic Agent

Implement single Agent + 3-5 basic tools
Complete a specific scenario (such as Q&A assistant)
Establish basic monitoring

Phase 2 (1 month): Level 2 Planning Agent

Add task decomposition capability
Implement multi-step task execution
Establish an evaluation system

Phase 3 (2-3 months): Level 3 multi-agent system

Designing a multi-agent architecture
Implement collaboration between agents
Improve the operation and maintenance system

Phase 4 (6 months+): Level 4 self-evolving system

Add learning capabilities
Achieve self-optimization
Establish a closed feedback loop

7.2 Common pitfalls

Trap 1: Over-Engineering

Symptom: Designing complex multi-agent architectures for simple tasks
Solution: Start simple and evolve as needed

Trap 2: Neglecting Assessment

Symptom: After going online, it is found that the quality is not up to standard
Solution: Establish an evaluation system and data-driven iteration

Trap 3: Prompt words are hardcoded

Symptom: Prompt words scattered throughout the code
Solution: Use prompt word management system

Trap 4: Lack of Security Considerations

Symptom: Agent performs dangerous operations
Solution: Establish permission control and audit mechanism

Trap 5: Ignoring user experience

Symptom: Agent is powerful but difficult to use
Solution: Continue to collect user feedback

7.3 Tool chain recommendation

category	tool	use
development framework	LangChain, LlamaIndex	Agent development
Prompt word management	PromptLayer, Weights & Biases	Prompt word version management
Assessment test	OpenAI Evals, Promptflow	quality assessment
Observability	LangSmith, Langfuse	Monitoring and tracking
deploy	Docker, Kubernetes	Production deployment

8. Future Outlook

Agent technology is developing rapidly and we can expect:

More powerful inference models: Inference models such as o1 will significantly improve the Agent’s planning capabilities
Standardized Agent Protocol: MCP and other protocols will become industry standards
Native Agent Support: Operating systems and browsers will natively support Agent
Agent Market: There will be an Agent Market similar to the App Store
New model of human-machine collaboration: Agent will become a standard “colleague” of humans

Reference and further reading

Original text: Kaggle Introduction to Agents Whitepaper
Authors: Alan Blount, Antonio Gulli, Shubham Saboo, Michael Zimmermann, Vladimir Vuskovic
Release date: November 2025
Related projects: LangChain, AutoGPT, MetaGPT

*This article adopts the CC BY-NC-SA 4.0 license agreement. If you need to reprint, please indicate the source. *

Reading path

Continue along this topic path

Follow the recommended order for AI engineering practice instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Original interpretation: Kaggle white paper "Introduction to Agents" - AI Agent introduction and architecture panorama

Introduction: From a model that can answer questions to an Agent that can complete tasks

1. From predictive AI to autonomous agents: a paradigm shift

1.1 Paradigm comparison: from passive to active

1.2 What exactly is Agent?

1.3 Why is Agent important now?

2. Five levels of Agent: from reasoning to self-evolution

Level 0: Core Reasoning System

Level 1: Connected Problem-Solver

Level 2: Strategic Problem-Solver

Level 3: Collaborative Multi-Agent System

Level 4: Self-Evolving System

3. Core architecture analysis: models, tools and orchestration layers

3.1 Model: Agent’s “brain”

3.2 Tool: Agent’s “Hand”

3.3 Orchestration Layer: Agent’s “nervous system”

4. Key Considerations in Design Selection

4.1 Domain knowledge and role setting

4.2 Context enhancement strategy

4.3 Multi-Agent system design pattern

5. Agent operation and maintenance: structured response to uncertainty

5.1 Measuring key indicators

5.2 A/B testing thinking

5.3 LLM as judge

5.4 Observability

6. Agent interoperability: connecting people, agents and businesses

6.1 Agent and people

6.2 Agent and Agent

6.3 Agent and business

7. Practical Enlightenment

7.1 Getting started suggestions

7.2 Common pitfalls

7.3 Tool chain recommendation

8. Future Outlook

Reference and further reading

Continue along this topic path

Go deeper into this topic

Subscribe to updates

Comments and discussion

Introduction: From a model that can answer questions to an Agent that can complete tasks

1. From predictive AI to autonomous agents: a paradigm shift

1.1 Paradigm comparison: from passive to active

1.2 What exactly is Agent?

1.3 Why is Agent important now?

2. Five levels of Agent: from reasoning to self-evolution

Level 0: Core Reasoning System

Level 1: Connected Problem-Solver

Level 2: Strategic Problem-Solver

Level 3: Collaborative Multi-Agent System

Level 4: Self-Evolving System

3. Core architecture analysis: models, tools and orchestration layers

3.1 Model: Agent’s “brain”

3.2 Tool: Agent’s “Hand”

3.3 Orchestration Layer: Agent’s “nervous system”

4. Key Considerations in Design Selection

4.1 Domain knowledge and role setting

4.2 Context enhancement strategy

4.3 Multi-Agent system design pattern

5. Agent operation and maintenance: structured response to uncertainty

5.1 Measuring key indicators

5.2 A/B testing thinking

5.3 LLM as judge

5.4 Observability

6. Agent interoperability: connecting people, agents and businesses

6.1 Agent and people

6.2 Agent and Agent

6.3 Agent and business

7. Practical Enlightenment

7.1 Getting started suggestions

7.2 Common pitfalls

7.3 Tool chain recommendation

8. Future Outlook

Reference and further reading

Continue along this topic path

Technical Interpretation Index | Curated Translations

Original interpretation: Discovery and prevention of silent hallucination in RAG system

Original interpretation: How AI Agent implements large-scale testing quality access control

Continue with this topic

Original interpretation: Agent quality assessment - the cornerstone of trust in the AI ​​era

Original interpretation: MCP protocol - the USB-C moment of the Agent ecosystem

Original Interpretation: Contextual Engineering—The Forgotten Core Battlefield in the AI ​​Era

Go deeper into this topic

Subscribe to updates

Comments and discussion

Original interpretation: Agent quality assessment - the cornerstone of trust in the AI era

Original Interpretation: Contextual Engineering—The Forgotten Core Battlefield in the AI Era