Article
Original interpretation: Kaggle white paper "Introduction to Agents" - AI Agent introduction and architecture panorama
In-depth analysis of the five levels, core architecture and production practices of Agent, and sorting out the key framework and inspiration of the Kaggle white paper "Introduction to Agents"
📋 Copyright Statement and Disclaimer
This article is the author’s original analysis article based on the Kaggle white paper “Introduction to Agents”.
Opinion Attribution Statement:
- The analysis of the five levels of Agent, architecture evolution and engineering practice in this article is the author’s independent organization and reinterpretation based on the white paper
- The writing structure, case organization and key judgments reflect the author’s personal understanding
- The article is not a paragraph-by-paragraph translation of the white paper, but an original interpretation oriented to engineering practice.
Original reference:
- Title: “Introduction to Agents”
- Authors: Alan Blount, Antonio Gulli, Shubham Saboo, Michael Zimmermann, Vladimir Vuskovic
- Link: Read original text
Original nature: This article is an independently created interpretation article, not a translation or rewriting. The views expressed in this article represent only the author’s personal understanding and may differ from the original author’s position.
Introduction: From a model that can answer questions to an Agent that can complete tasks
In the past few years, the most amazing capabilities of large models have been mainly reflected in “answering questions” and “generating content”: they can write code, make summaries, generate copywriting, and can also give almost expert-level explanations in many scenarios. But if we switch the perspective from “generating an answer” to “accomplishing a goal”, we will immediately discover the boundaries of traditional LLM - it can provide suggestions, but it may not be able to actively plan, call tools, decompose tasks and execute them all the way to the results.
The real value of the Kaggle white paper “Introduction to Agents” is that it clearly explains this difference: **Agent is not as simple as connecting LLM to a certain workflow, but a complete application system that can continuously make decisions and actions around goals. **
This article will not be repeated paragraph by paragraph in the order of the white paper, but from the perspective of engineering practice, reorganizing the most noteworthy parts: Why is Agent a paradigm shift? How to understand its five levels? From basic reasoning and tool connection to multi-step planning and system collaboration, how does the complexity of the architecture increase step by step? And, why should these concepts directly impact the way you build AI products in the future?
1. From predictive AI to autonomous agents: a paradigm shift
In the past few years, the development focus of AI has always been around the passive, discrete task model - translating a paragraph, answering a question, and generating a picture based on prompts. Although this paradigm is powerful, it requires explicit instructions from humans at every step.
We are now witnessing a paradigm shift from AI that can only predict or create content to new types of software that can solve problems and perform tasks autonomously.
This is the core idea of AI Agent.
1.1 Paradigm comparison: from passive to active
Let’s understand this difference through a specific scenario:
Traditional AI (predictive):
User: What will the weather be in Beijing tomorrow?
AI: Beijing will be cloudy tomorrow, with temperatures from 15 to 22 degrees.
(conversation ends; wait for the next instruction)
Agent (autonomous):
User: I want to visit Beijing this weekend. Any suggestions?
Agent:
1. check Beijing weekend weather forecast
2. search popular Beijing weekend attractions
3. recommend outdoor activities based on the weather
4. check traffic information and recommend transport options
5. combine the information into a complete itinerary
Here is a complete recommendation: because Saturday is sunny, visit the Great Wall in the morning and the Summer Palace in the afternoon...
Agents don’t just answer questions, but actively plan and execute a series of actions to achieve goals.
1.2 What exactly is Agent?
Many people mistakenly believe that Agent embeds LLM into workflow. But in Kaggle’s authoritative 54-page white paper, the authors provide a more precise definition:
The Agent is not an AI model in a static workflow, but a complete application system - one that can make plans and take actions to achieve goals.
Agents combine the reasoning capabilities of language models with practical action capabilities, enabling them to handle complex, multi-step tasks that cannot be accomplished by the model itself.
The key capability is: Agent can work independently and independently figure out the subsequent steps required to achieve the goal without human guidance at every step.
1.3 Why is Agent important now?
There are three key factors driving the rise of Agent:
- Leap in model capabilities: Large models such as GPT-4 and Claude have powerful reasoning and planning capabilities
- Mature tool ecosystem: Protocols such as Function Calling and MCP standardize the connection between models and the external world.
- Actual demand driven: Enterprises need AI systems that can automate complex business processes
2. Five levels of Agent: from reasoning to self-evolution
The white paper proposes a very valuable Taxonomy, which divides the Agent system into 5 levels. This classification not only helps us understand the evolution path of Agent, but also provides a clear reference framework for architecture design.
Level 0: Core Reasoning System
This is the base layer and contains only the language model itself. It can reason and answer questions, but has no ability to interact with the outside world.
Capability Boundary:
- Can answer knowledge questions
- Able to reason logically
- Can generate text and code
- Unable to obtain real-time information
- Unable to perform external operation
Typical Application:
- Internal knowledge Q&A (based on training data)
- Text generation and polishing
- Simple logical reasoning tasks
Architecture Example:
user input → LLM → model output
Code Example:
from openai import OpenAI
client = OpenAI()
class Level0Agent:
def __init__(self):
self.system_prompt = "You are a helpful assistant."
def chat(self, user_message):
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": user_message}
]
)
return response.choices[0].message.content
Level 1: Connected Problem-Solver
Based on Level 0, tool calling capability is added. Agents can search web pages, query databases, and call APIs to obtain information or perform actions.
Core Improvements:
- Ability to access real-time information
- Can call external API
- Have basic data acquisition capabilities
Typical Application:
- Q&A assistant with search function
- Weather query robot
- Stock information query
Architecture Example:
user input → LLM → Need tool? → Call tool → merge results → output
↓ No ↑
└────────→ Direct output
Code Example:
import json
from openai import OpenAI
client = OpenAI()
class Level1Agent:
def __init__(self):
self.tools = [
{
"type": "function",
"function": {
"name": "search_web",
"description": "search the web for real-time information",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "get_weather",
"description": "get weather for the specified city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"},
"date": {"type": "string"}
},
"required": ["city"]
}
}
}
]
def execute_tool(self, tool_name, params):
if tool_name == "get_weather":
# real implementation would call a weather API
return {"temperature": 22, "condition": "sunny"}
return {}
def chat(self, user_message):
messages = [{"role": "user", "content": user_message}]
# first call: request tool calls
response = client.chat.completions.create(
model="gpt-4",
messages=messages,
tools=self.tools,
tool_choice="auto"
)
message = response.choices[0].message
# check whether tool calls are needed
if message.tool_calls:
# execute tool calls
tool_results = []
for tool_call in message.tool_calls:
function_name = tool_call.function.name
function_params = json.loads(tool_call.function.arguments)
result = self.execute_tool(function_name, function_params)
tool_results.append({
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": json.dumps(result)
})
# append tool results to the conversation
messages.append(message)
messages.extend(tool_results)
# second call: get the final response
final_response = client.chat.completions.create(
model="gpt-4",
messages=messages
)
return final_response.choices[0].message.content
return message.content
Level 2: Strategic Problem-Solver
This layer introduces planning capabilities and multi-step reasoning. Agent can decompose complex tasks into subtasks and execute them in strategic order.
Core Improvements:
- Task Decomposition
- Multi-step planning
- error recovery mechanism
- Processing and integration of intermediate results
Typical Application:
- Research report generation (search → analysis → writing)
- Data analysis tasks (obtaining data → cleaning → analysis → visualization)
- Complex problem diagnosis
Architecture Example:
User goal -> Task decomposer -> Subtask 1 -> Execute -> Result
↓
├→ Subtask2 → Execute → result
↓
├→ Subtask3 → Execute → result
↓
Result combiner → Final output
Code Example:
class Level2Agent:
def __init__(self):
self.level1_agent = Level1Agent()
def decompose_task(self, goal):
"""decompose a complex task into subtasks"""
decomposition_prompt = f"""
decompose the following goal into executable steps:
Goal: {goal}
Requirements:
1. each step should be executable
2. dependencies between steps must be explicit
3. output format is a JSON array
Example output:
[
{"step": 1, "action": "search related material", "depends_on": []},
{"step": 2, "action": "analyze collected information", "depends_on": [1]},
{"step": 3, "action": "generate a report", "depends_on": [2]}
]
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": decomposition_prompt}]
)
steps = json.loads(response.choices[0].message.content)
return steps
def execute_task(self, goal):
"""execute complex task"""
# 1. task decomposition
steps = self.decompose_task(goal)
# 2. execute in dependency order
results = {}
for step in steps:
step_num = step["step"]
action = step["action"]
# build context, including previous step results
context = f"Task: {action}\n"
if step.get("depends_on"):
for dep in step["depends_on"]:
context += f"step{dep} result: {results.get(dep, '')}\n"
# execute step
result = self.level1_agent.chat(context)
results[step_num] = result
print(f"step {step_num} complete: {action}")
# 3. merge results
final_result = self.synthesize_results(goal, results)
return final_result
def synthesize_results(self, goal, results):
"""merge results from all steps"""
synthesis_prompt = f"""
Goal: {goal}
Step execution results:
{json.dumps(results, ensure_ascii=False, indent=2)}
Merge the results above into the final output.
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": synthesis_prompt}]
)
return response.choices[0].message.content
Level 3: Collaborative Multi-Agent System
Multiple Agents work together, and each Agent may be specialized in a specific field. They can work together, communicate with each other, and coordinate complex parallel tasks.
Core Improvements:
- Specialized division of labor
- Communication between agents
- Parallel execution
- Results coordination and integration
Analogy: Just like a software development team, there are product managers, back-end engineers, front-end engineers, and testers, each responsible for different aspects.
Typical Application:
- Software development assistant (requirements analysis → design → coding → testing)
- Investment research team (market analysis → financial analysis → risk assessment → investment advice)
- Content creation team (planning → writing → editing → design)
Architecture Example:
┌→ Research Agent → data
User request -> Coordinator Agent +-> Analysis Agent -> Insight
├→ Writing Agent → content
└→ Review Agent → quality check
↓
Integrated output
Code Example:
class SpecializedAgent:
"""specialized Agent base class"""
def __init__(self, name, expertise):
self.name = name
self.expertise = expertise
def process(self, task, context=None):
prompt = f"""
You are {self.name}, specialized in {self.expertise}.
Task: {task}
{f"Context: {context}" if context else ""}
Complete this task based on your area of expertise.
"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return {
"agent": self.name,
"result": response.choices[0].message.content
}
class Level3MultiAgent:
def __init__(self):
self.coordinator = SpecializedAgent("Coordinator", "task decomposition and coordination")
self.researcher = SpecializedAgent("Researcher", "information collection and research")
self.analyst = SpecializedAgent("Analyst", "data analysis and insight")
self.writer = SpecializedAgent("Writer", "content creation and writing")
def execute(self, goal):
"""multi-agent collaborative execution"""
# 1. the coordinator decomposes and assigns the task
coordination_prompt = f"""
Goal: {goal}
Available agents:
- Researcher: responsible for information collection
- Analyst: responsible for data analysis
- Writer: responsible for content creation
Create a collaboration plan with each agent task, input, and output.
"""
plan_response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": coordination_prompt}]
)
plan = plan_response.choices[0].message.content
# 2. parallel execution, simplified example
research_result = self.researcher.process(f"research topic: {goal}")
# the analyst starts after the researcher finishes
analysis_result = self.analyst.process(
f"analyze the following research result: {research_result['result']}",
context=research_result['result']
)
# the writer integrates the final output
final_result = self.writer.process(
f"write the final report based on this analysis: {analysis_result['result']}",
context=f"research: {research_result['result']}\nanalysis: {analysis_result['result']}"
)
return {
"plan": plan,
"research": research_result,
"analysis": analysis_result,
"final": final_result
}
Level 4: Self-Evolving System
At the highest level, agents can learn from experience, self-optimize strategies, improve their own tool usage, and even modify their own system configurations.
Core Improvements:
- learn from experience
- self-optimization strategy
- Tool usage improvements
- System configuration adaptive
This represents the ultimate vision of Agent technology—a truly autonomous system capable of continuous improvement.
Typical Application:
- Self-optimizing customer service system
- Adaptive Trading Strategies
- Continuous learning research assistant
Core Mechanism:
Execute → observe result → analyze performance → identify improvements → adjust strategy → execute again
↑ ↓
└──────────────── feedback loop ←───────────────────────┘
3. Core architecture analysis: models, tools and orchestration layers
The white paper details the three core components of Agent. Understanding the relationship between these three is the key to designing the Agent system.
3.1 Model: Agent’s “brain”
The model is responsible for the Agent’s “thinking” process:
Core Responsibilities:
- Reasoning: Understand the problem, analyze the situation, and make logical deductions
- Planning: Develop action steps and determine the order of execution
- Decision: Choosing which tools to use and when to stop
- Reflection: Evaluate execution results, identify errors and correct them
Model selection matrix:
| scene | Recommended model | reason |
|---|---|---|
| complex reasoning tasks | GPT-4, Claude-3 Opus | Powerful multi-step reasoning capabilities |
| code generation | Claude-3.5 Sonnet, GPT-4 | Strong code understanding and generation skills |
| cost sensitive scenarios | GPT-3.5, Claude-3 Haiku | High cost performance |
| Long context handling | Claude-3 (200K), GPT-4 Turbo (128K) | large context window |
| real-time interaction | GPT-4 Turbo | low latency |
Prompt Word Engineering Principles:
- Clear role definition: Clarify the identity and responsibility boundaries of the Agent
- Context Sufficient: Provide sufficient background information and constraints
- Output format specification: Use structured output to facilitate subsequent processing
- Example guidance: Provide input and output examples to help model understanding
3.2 Tool: Agent’s “Hand”
Tools enable the Agent to interact with the world and are an extension of the Agent’s capabilities.
Information acquisition tools (Grounding):
# Search engine
search_tool = {
"name": "web_search",
"description": "search the internet for real-time information",
"parameters": {
"query": "search keywords",
"num_results": "number of results to return"
}
}
# Database query
database_tool = {
"name": "query_database",
"description": "query the internal database",
"parameters": {
"sql": "SQL query statement",
"database": "target database"
}
}
# RAG retrieval
rag_tool = {
"name": "retrieve_knowledge",
"description": "retrieve relevant information from the knowledge base",
"parameters": {
"query": "retrieval query",
"top_k": "number of results to return"
}
}
Operation execution tools (Action):
# Send email
email_tool = {
"name": "send_email",
"description": "Send email",
"parameters": {
"to": "recipient",
"subject": "subject",
"body": "content"
}
}
# Create calendar event
calendar_tool = {
"name": "create_event",
"description": "Create calendar event",
"parameters": {
"title": "event title",
"start_time": "start time",
"end_time": "end time"
}
}
# API call
generic_api_tool = {
"name": "call_api",
"description": "callAPI",
"parameters": {
"url": "API",
"method": "HTTP method",
"headers": "request headers",
"body": "request body"
}
}
Tool Design Best Practices:
- Atomic: Each tool only does one thing and does one thing well
- Impotence: The same input produces the same output, making it easy to retry
- Self-descriptive: Tool description should clearly express its purpose and parameters
- Error handling: clearly define error types and handling methods
3.3 Orchestration Layer: Agent’s “nervous system”
The orchestration layer is the core of the Agent system and is responsible for coordinating the interaction of models and tools.
Core Responsibilities:
- Conversation state management: Maintain the context of multiple rounds of dialogue
- Tool Call Orchestration: Handles the input/output transformation of tool calls
- Context Window Management: Ensures that the context limits of the model are not exceeded
- Error Handling and Retry: Handle various error situations gracefully
- Human-machine collaborative handoff: Seek human confirmation when needed
Orchestration layer architecture:
┌─────────────────────────────────────────────────────────┐
│ Orchestration Layer │
├─────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Context │ │ Tool │ │ Error │ │
│ │ Manager │ │ Executor │ │ Handler │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
├─────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Session │ │ Planning │ │ Human │ │
│ │ Store │ │ Engine │ │ Handoff │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
Code Example - Complete Orchestrator:
import asyncio
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
from enum import Enum
class TaskStatus(Enum):
PENDING = "pending"
IN_PROGRESS = "in_progress"
COMPLETED = "completed"
FAILED = "failed"
@dataclass
class Task:
id: str
description: str
status: TaskStatus
result: Optional[Any] = None
error: Optional[str] = None
class Orchestrator:
"""Agent"""
def __init__(self, llm_client, tools, session_store):
self.llm = llm_client
self.tools = {tool["function"]["name"]: tool for tool in tools}
self.session_store = session_store
self.max_iterations = 10 # prevent infinite loops
async def execute(self, user_input: str, session_id: str) -> str:
"""Executeuser request"""
# 1. load session context
context = await self.load_context(session_id)
# 2. build message history
messages = context + [{"role": "user", "content": user_input}]
# 3. execution loop
iteration = 0
while iteration < self.max_iterations:
iteration += 1
# call the LLM
response = await self.call_llm(messages)
# check whether the task is complete
if not response.tool_calls:
# no tool calls; return the result directly
await self.save_context(session_id, messages + [response])
return response.content
# execute tool calls
tool_results = await self.execute_tools(response.tool_calls)
# append the result to message history
messages.extend([response] + tool_results)
# check whether human intervention is needed
if await self.needs_human_handoff(messages):
return await self.request_human_handoff(session_id, messages)
# maximum iteration count exceeded
return "The task is taking too long. Please simplify your request or try again later."
async def call_llm(self, messages: List[Dict]) -> Any:
"""call the LLM"""
return self.llm.chat.completions.create(
model="gpt-4",
messages=messages,
tools=list(self.tools.values()),
tool_choice="auto"
).choices[0].message
async def execute_tools(self, tool_calls: List[Any]) -> List[Dict]:
"""execute tool calls"""
results = []
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
try:
# actual tool execution logic
result = await self.run_tool(function_name, function_args)
results.append({
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": json.dumps(result)
})
except Exception as e:
results.append({
"tool_call_id": tool_call.id,
"role": "tool",
"name": function_name,
"content": json.dumps({"error": str(e)})
})
return results
async def run_tool(self, name: str, args: Dict) -> Any:
"""tool"""
# dispatch to the concrete implementation by tool name
tool_implementations = {
"search_web": self.search_web,
"get_weather": self.get_weather,
# ... more tools
}
if name in tool_implementations:
return await tool_implementations[name](**args)
else:
raise ValueError(f"Unknown tool: {name}")
async def needs_human_handoff(self, messages: List[Dict]) -> bool:
"""determine whether human handoff is needed"""
# implement decision logic, for example:
# - sensitive operation detected
# - multiple retries failed
# - the user explicitly requested human help
return False
async def load_context(self, session_id: str) -> List[Dict]:
"""load session context"""
return await self.session_store.get(session_id, default=[])
async def save_context(self, session_id: str, messages: List[Dict]):
"""save session context"""
# truncate overly long context
truncated = self.truncate_context(messages)
await self.session_store.set(session_id, truncated)
def truncate_context(self, messages: List[Dict], max_tokens: int = 8000) -> List[Dict]:
"""context token constraint"""
# simplified implementation: keep the system message and recent messages
system_msgs = [m for m in messages if m.get("role") == "system"]
other_msgs = [m for m in messages if m.get("role") != "system"]
# keep recent messages
keep_count = min(len(other_msgs), 10) # simplified logic
return system_msgs + other_msgs[-keep_count:]
4. Key Considerations in Design Selection
4.1 Domain knowledge and role setting
It is crucial to set a clear persona and domain knowledge for your Agent:
Character setting template:
You are a professional {role}, specialized in {domain}.
Your responsibilities:
1. {responsibility1}
2. {responsibility2}
3. {responsibility3}
Your behavior guidelines:
- {guideline1}
- {guideline2}
- {guideline3}
Your constraints:
- must not{constraint1}
- must not{constraint2}
Example - Medical Consultation Agent:
You are a medical information assistant.
Your responsibilities:
1. Understand the user's symptoms and background.
2. Provide general health information and next-step suggestions.
3. Ask clarifying questions when the user's situation is unclear.
Your behavior guidelines:
- Use plain language.
- explain professional concepts in plain language
- Encourage professional care when risk signals appear.
Your constraints:
- you must not provide a diagnosis; advise the user to consult a qualified physician
- you must not recommend prescription medication or treatment plans
- you must not handle medical emergencies; advise the user to call emergency services
4.2 Context enhancement strategy
Short Term Memory Management:
class ShortTermMemory:
def __init__(self, max_messages=20):
self.messages = []
self.max_messages = max_messages
def add(self, message):
self.messages.append(message)
if len(self.messages) > self.max_messages:
self.messages = self.messages[-self.max_messages:]
def get_context(self):
return self.messages
def clear(self):
self.messages = []
Long term memory implementation:
import hashlib
class LongTermMemory:
def __init__(self, vector_store):
self.vector_store = vector_store
async def remember(self, user_id: str, key: str, value: str, importance: float = 1.0):
"""store long-term memory"""
memory_id = hashlib.md5(f"{user_id}:{key}".encode()).hexdigest()
await self.vector_store.upsert(
ids=[memory_id],
documents=[value],
metadatas=[{
"user_id": user_id,
"key": key,
"importance": importance,
"timestamp": time.time()
}]
)
async def recall(self, user_id: str, query: str, top_k: int = 5):
"""retrieve relevant memory"""
results = await self.vector_store.query(
query_texts=[query],
where={"user_id": user_id},
n_results=top_k
)
return results
4.3 Multi-Agent system design pattern
The white paper mentions several common multi-agent architectures:
1. Master-Worker:
Master Agent
├─ Worker 1 (data processing)
├─ Worker 2 (analysis)
└─ Worker 3 (generate)
Applicable scenarios: Tasks can be clearly decomposed into subtasks
2. Pipeline:
Input → Agent A → Agent B → Agent C → Output
() (analysis) (generate)
Applicable scenarios: Scenarios with fixed data processing processes
3. Competitive:
Input ┬→ Agent A ─┐
├→ Agent B ─┼→ Selector → Best Output
└→ Agent C ─┘
Applicable scenarios: Scenarios that require multiple options to choose the best
4. Collaborative:
┌←→┐
Agent A ─┤ ├─ Agent B
└←→┘
↕ ↕
Agent C ←→ Agent D
Applicable scenarios: Complex issues require multi-party negotiation
5. Agent operation and maintenance: structured response to uncertainty
Unlike traditional software, Agent’s behavior is non-deterministic. The white paper proposes the concept of “Agent Ops”.
5.1 Measuring key indicators
Core indicator system:
| Indicator category | Specific indicators | Calculation method | target value |
|---|---|---|---|
| Mission accomplished | success rate | successful mission/total mission | > 90% |
| efficiency | average number of steps | Total steps/completion tasks | < 5 steps |
| cost | Cost per task | Token fee + API fee | < $0.1 |
| Delay | average response time | Total time/number of requests | < 3s |
| quality | User satisfaction | Rating average | > 4.0/5 |
Indicator monitoring code example:
from dataclasses import dataclass
from typing import List
import time
@dataclass
class AgentMetrics:
total_requests: int = 0
successful_requests: int = 0
total_steps: int = 0
total_latency: float = 0
total_cost: float = 0
user_ratings: List[float] = None
def __post_init__(self):
if self.user_ratings is None:
self.user_ratings = []
@property
def success_rate(self):
if self.total_requests == 0:
return 0
return self.successful_requests / self.total_requests
@property
def avg_steps(self):
if self.successful_requests == 0:
return 0
return self.total_steps / self.successful_requests
@property
def avg_latency(self):
if self.total_requests == 0:
return 0
return self.total_latency / self.total_requests
@property
def avg_cost(self):
if self.total_requests == 0:
return 0
return self.total_cost / self.total_requests
@property
def avg_rating(self):
if not self.user_ratings:
return 0
return sum(self.user_ratings) / len(self.user_ratings)
def report(self):
return f"""
Agent performance report
==============
request: {self.total_requests}
: {self.success_rate:.2%}
step: {self.avg_steps:.1f}
: {self.avg_latency:.2f}s
Cost:${self.avg_cost:.4f}
User: {self.avg_rating:.2f}/5
"""
5.2 A/B testing thinking
Think of the Agent as a system that requires continuous experimentation:
A/B Testing Framework:
import random
class ABTestFramework:
def __init__(self):
self.experiments = {}
def create_experiment(self, name, variants, traffic_split):
"""
create
variants: {'control': control_config, 'treatment': treatment_config}
traffic_split: {'control': 0.5, 'treatment': 0.5}
"""
self.experiments[name] = {
'variants': variants,
'traffic_split': traffic_split,
'metrics': {'control': [], 'treatment': []}
}
def get_variant(self, experiment_name, user_id):
"""barsUser ID assign"""
exp = self.experiments[experiment_name]
# use the user ID hash to keep assignment consistent
hash_val = hash(f"{experiment_name}:{user_id}") % 100
cumulative = 0
for variant, split in exp['traffic_split'].items():
cumulative += split * 100
if hash_val < cumulative:
return variant, exp['variants'][variant]
return 'control', exp['variants']['control']
def record_metric(self, experiment_name, variant, metric_value):
"""record experiment metrics"""
self.experiments[experiment_name]['metrics'][variant].append(metric_value)
def analyze_results(self, experiment_name):
"""analyze experiment results"""
metrics = self.experiments[experiment_name]['metrics']
control_metrics = metrics['control']
treatment_metrics = metrics['treatment']
# calculate statistical significance
from scipy import stats
t_stat, p_value = stats.ttest_ind(control_metrics, treatment_metrics)
control_mean = sum(control_metrics) / len(control_metrics)
treatment_mean = sum(treatment_metrics) / len(treatment_metrics)
lift = (treatment_mean - control_mean) / control_mean
return {
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'lift': lift,
'p_value': p_value,
'significant': p_value < 0.05
}
5.3 LLM as judge
Use another LLM to evaluate the agent’s output quality:
class LLMJudge:
def __init__(self, judge_model="gpt-4"):
self.judge_model = judge_model
async def evaluate(self, task_description, agent_output, criteria):
"""Agent output"""
prompt = f"""
You are an impartial judge. Evaluate the quality of the following Agent output.
taskdescription:
{task_description}
Agent output:
{agent_output}
:
{criteria}
Return the evaluation result in the following JSON format:
{{
"overall_score": 1-5,
"dimension_scores": {{
"accuracy": 1-5,
"completeness": 1-5,
"clarity": 1-5,
"usefulness": 1-5
}},
"reasoning": "descriptionscore",
"strengths": ["pros1", "pros2"],
"weaknesses": ["insufficient1", "insufficient2"],
"suggestions": ["suggestions1", "suggestions2"]
}}
"""
response = await self.call_llm(prompt)
return json.loads(response)
async def batch_evaluate(self, test_cases):
"""batch evaluation"""
results = []
for case in test_cases:
result = await self.evaluate(
case['task'],
case['output'],
case['criteria']
)
results.append(result)
return results
5.4 Observability
Use OpenTelemetry to track the execution path of the Agent:
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# configuration
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
otlp_exporter = OTLPSpanExporter(endpoint="your-collector-endpoint")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
class ObservableAgent:
@tracer.start_as_current_span("agent_execution")
async def execute(self, user_input):
with tracer.start_as_current_span("intent_classification") as span:
intent = await self.classify_intent(user_input)
span.set_attribute("intent", intent)
with tracer.start_as_current_span("tool_execution") as span:
tools_used = []
for tool_call in self.plan_tools(intent):
with tracer.start_as_current_span(f"tool_{tool_call.name}"):
result = await self.execute_tool(tool_call)
tools_used.append(tool_call.name)
span.set_attribute("tools_used", tools_used)
with tracer.start_as_current_span("response_generation"):
response = await self.generate_response()
return response
6. Agent interoperability: connecting people, agents and businesses
The white paper concludes by exploring the interoperability of the Agent ecosystem.
6.1 Agent and people
Human-machine collaboration mode:
-
Human-in-the-loop:
- Agent requests confirmation before execution
- Manual review of key decision points
- Confirmation of complex tasks in stages
-
Human-on-the-loop:
- Agent executes autonomously
- Human monitoring execution process
- Manual intervention in abnormal situations
-
Human-in-command:
- Humans set high-level goals
- Agent autonomous planning and execution
- Report progress regularly
Handover design principles:
class HumanHandoff:
def __init__(self):
self.handoff_triggers = [
self.check_sensitive_operation,
self.check_confidence_threshold,
self.check_user_frustration,
self.check_repeated_failures
]
async def should_handoff(self, context) -> bool:
"""determine whether human handoff is needed"""
for trigger in self.handoff_triggers:
if await trigger(context):
return True
return False
async def handoff(self, context, reason):
"""Execute"""
handoff_context = {
"conversation_history": context.messages,
"pending_tasks": context.pending_tasks,
"reason": reason,
"suggested_action": await self.suggest_action(context)
}
# human
await self.notify_human_agent(handoff_context)
return "human,..."
6.2 Agent and Agent
Cross-Agent Communication Protocol:
class AgentCommunicationProtocol:
"""Agent"""
async def send_message(self, from_agent, to_agent, message_type, payload):
"""message"""
message = {
"from": from_agent,
"to": to_agent,
"type": message_type, # request, response, broadcast
"payload": payload,
"timestamp": time.time(),
"message_id": generate_id()
}
await self.message_bus.send(message)
async def request_capability(self, agent_id, capability_requirement):
"""requestother Agent capability"""
available_agents = await self.discover_agents(capability_requirement)
if not available_agents:
raise NoAgentAvailable(capability_requirement)
# Agent
selected = self.select_best_agent(available_agents)
# request
response = await self.send_request(selected, capability_requirement)
return response
6.3 Agent and business
Agent Economic Model:
-
Pay-per-use:
- Billed per API call
- Billed based on Token consumption
- Billing based on task completion
-
Subscription mode (Subscription):
- Basic Edition/Professional Edition/Enterprise Edition
- Functional classification
- Usage quota
-
Outcome-based:
- Billing based on business results
- Pay only if you succeed
- risk sharing
7. Practical Enlightenment
7.1 Getting started suggestions
Progressive evolution path:
Phase 1 (1-2 weeks): Level 1 Basic Agent
- Implement single Agent + 3-5 basic tools
- Complete a specific scenario (such as Q&A assistant)
- Establish basic monitoring
Phase 2 (1 month): Level 2 Planning Agent
- Add task decomposition capability
- Implement multi-step task execution
- Establish an evaluation system
Phase 3 (2-3 months): Level 3 multi-agent system
- Designing a multi-agent architecture
- Implement collaboration between agents
- Improve the operation and maintenance system
Phase 4 (6 months+): Level 4 self-evolving system
- Add learning capabilities
- Achieve self-optimization
- Establish a closed feedback loop
7.2 Common pitfalls
Trap 1: Over-Engineering
- Symptom: Designing complex multi-agent architectures for simple tasks
- Solution: Start simple and evolve as needed
Trap 2: Neglecting Assessment
- Symptom: After going online, it is found that the quality is not up to standard
- Solution: Establish an evaluation system and data-driven iteration
Trap 3: Prompt words are hardcoded
- Symptom: Prompt words scattered throughout the code
- Solution: Use prompt word management system
Trap 4: Lack of Security Considerations
- Symptom: Agent performs dangerous operations
- Solution: Establish permission control and audit mechanism
Trap 5: Ignoring user experience
- Symptom: Agent is powerful but difficult to use
- Solution: Continue to collect user feedback
7.3 Tool chain recommendation
| category | tool | use |
|---|---|---|
| development framework | LangChain, LlamaIndex | Agent development |
| Prompt word management | PromptLayer, Weights & Biases | Prompt word version management |
| Assessment test | OpenAI Evals, Promptflow | quality assessment |
| Observability | LangSmith, Langfuse | Monitoring and tracking |
| deploy | Docker, Kubernetes | Production deployment |
8. Future Outlook
Agent technology is developing rapidly and we can expect:
- More powerful inference models: Inference models such as o1 will significantly improve the Agent’s planning capabilities
- Standardized Agent Protocol: MCP and other protocols will become industry standards
- Native Agent Support: Operating systems and browsers will natively support Agent
- Agent Market: There will be an Agent Market similar to the App Store
- New model of human-machine collaboration: Agent will become a standard “colleague” of humans
Reference and further reading
- Original text: Kaggle Introduction to Agents Whitepaper
- Authors: Alan Blount, Antonio Gulli, Shubham Saboo, Michael Zimmermann, Vladimir Vuskovic
- Release date: November 2025
- Related projects: LangChain, AutoGPT, MetaGPT
*This article adopts the CC BY-NC-SA 4.0 license agreement. If you need to reprint, please indicate the source. *
Reading path
Continue along this topic path
Follow the recommended order for AI engineering practice instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions