Article

Original interpretation: Agent quality assessment - the cornerstone of trust in the AI era

In-depth analysis of the essential challenges of Agent quality assessment and why quality engineering is the key to determining the success or failure of AI products

Topic · AI engineering practice

Agent Quality Evaluation Framework Llm Judge Ab Testing Original Interpretation

Introduction: The evaluation system that overturned immediately after it went online

It was the fall of 2024. The intelligent customer service agent that our team developed in two months finally passed the internal test and was ready to be officially launched. As the head of quality, I am full of confidence - our test coverage has reached 85%, the automated evaluation script is running normally, and the manual evaluation sample shows a satisfaction score of 4.5.

In the first week of launch, everything seemed to be going smoothly. The system is running stably, user feedback is positive, and the business side is even preparing to expand promotion.

On Monday morning of the following week, my phone was buzzing with calls.

The customer service supervisor told me angrily: “The system is completely out of control! Starting at midnight yesterday, the Agent began to recommend completely unrelated products to users, and even recommended competitors’ products to our customers!”

After emergency investigation, we found that the problem lies in our evaluation system. Our test set covers “normal scenarios” but not “edge scenarios”. When users ask questions using some specific dialects late at night, the Agent’s understanding is biased, causing the recommendation algorithm to completely fail.

The deeper problem is: **Our evaluation metric is “human scoring”, but human evaluators test under ideal conditions and do not cover the complexity of real production environments. **

This incident caused us to lose 23% of our active users. More importantly, it destroyed the business team’s trust in the AI system. Three months later, the project was halted.

At that moment, I deeply realized: **Agent quality assessment is not a technical detail, but the core capability that determines the life or death of AI products. Without credible evaluation, there is no credible AI. **

Chapter 1: Why Agent Quality Assessment is “Mission Impossible”

Misunderstanding 1: Treating Agent quality as an “accuracy issue”

My team and I also made this mistake in the early stages of AI product development. We believe that Agent quality assessment is “correct answer matching” - given input A, expected output B, Agent output C, and calculate the similarity between C and B.

The cost of this knowledge is huge.

In a legal consulting agent project, we spent three weeks building a “standard answer library” that contains 1,000 common legal questions and their “standard answers.” Our automated evaluation system calculates the semantic similarity between the Agent’s answer and the standard answer, with the goal of similarity >0.85.

The test results look great: 78% of the questions achieved the target similarity. However, after the launch, user satisfaction was only 3.2 points.

After in-depth analysis, we discovered the problem: **The “right answer” to a legal question is often not unique. ** Regarding the question “What should I do if I am fired by the company without reason?”, the standard answer focuses on the labor arbitration process, but what users actually need is information from multiple dimensions such as emotional comfort, action steps, and evidence preparation. Our “standard answer” is too narrow, and the Agent’s “deviation” is actually to meet the real needs of users.

**Key insight: Agent quality is not a single dimension of “accuracy”, but a multi-dimensional “degree of meeting user needs.” **

Myth 2: Underestimating the challenge of uncertainty

The testing of traditional software is deterministic: given input A, expected output B, if you get C, it is a bug. But Agent fundamentally changes this paradigm.

Different runs of the same input may produce different but equally reasonable outputs.

Example: User asked “Recommend a programming introductory book”

Run 1: “Python Programming: From Introduction to Practice” - Suitable for zero basics
Run 2: “Fluent Python” - suitable for readers with basic knowledge
Run 3: “Head First Python” - with pictures and text, suitable for visual learners

All three answers are reasonable, but aimed at different readers. How to judge which one is better?

What’s worse is that the agent may also produce “reasonable but wrong” output. We encountered a case: Agent recommended “Python Data Science Handbook” to a user who wanted to learn data analysis. This book is indeed a classic, but the user later reported that “this book is too theoretical, and I want a more practical tutorial.”

There is nothing wrong with the Agent’s recommendation, it just doesn’t match the user’s implicit needs. This “reasonable but unsatisfactory” situation almost does not exist in deterministic software testing, but it abounds in agent testing.

**Key insight: The non-determinism of Agent is not a bug, but a feature. Quality assessment must accept and manage this non-certainty rather than attempt to eliminate it. **

Misunderstanding 3: Ignoring subjective quality standards

Even if they are aware of non-determinism, many teams’ understanding of “quality” still remains at an objective level. But a large part of agent quality is subjective.

We compared two versions of Agent in an A/B test:

Version A: Answer concisely and give the answer directly
Version B: Detailed answer, including explanation and background

Objective indicators show that version A has faster response time and less token consumption. But user satisfaction surveys show that version B is 23% higher. After in-depth interviews, it was found that users prefer the feeling of being “educated”, even if it means a longer waiting time.

Another case involves “tone.” Our agent’s tone was too “calm and professional” when answering complaint questions. Objectively speaking, the answer is accurate and complete. But users’ subjective feelings are “indifferent” and “unconcerned”. When we adjusted our tone and added expressions of empathy, satisfaction increased significantly.

**Key insight: Agent quality assessment must include subjective dimensions, and these dimensions are often better predictors of user satisfaction than objective indicators. **

Misunderstanding 4: Confusing “passed testing” and “production ready”

This is the biggest pitfall I’ve ever stepped into.

In our tests, the agent performed well. We have an evaluation set of 500 test cases covering common problem types. The pass rate of Agent reached 87%, which we think is enough to go online.

But the data distribution of the production environment is completely different from the test set.

The questions in the test set are “clean” - clear grammar, standard wording, and clear intent. But real users’ problems are “messy” - containing typos, colloquial expressions, ambiguous intent, and mixed multitasking.

We analyzed the top 1,000 real conversations in production and found:

32% contain typos or irregular expressions
28% are multiple questions mixed together
15% of intents did not match any test case
12% contain sarcasm, sentiment or other hidden messages

Our agent performed well on the “clean” test set, but frequently made mistakes in the “messy” real environment.

**Key insight: The key to Agent quality assessment is not “test set pass rate”, but “real environment adaptability”. The test set must be as close as possible to the production data distribution. **

Chapter 2: Five-dimensional evaluation framework of Agent quality

After more than two years of practical exploration, I have gradually formed a hierarchical understanding of Agent quality. This is not a textbook theoretical division, but an engineering practice framework summarized after countless pitfalls.

The white paper breaks down Agent quality into five core dimensions. The value of this framework lies not in its comprehensiveness, but in its operability.

The first dimension: functionality - whether the agent completed the task

Functionality answers the most basic question: Can the agent complete the task correctly?

This seems simple but is fraught with complexity in practice.

Task Completion Rate Trap

In a customer service agent project, we initially used “whether an answer is given” as the criterion for task completion. The results show that the completion rate is as high as 95%. But the business side was not satisfied - they found that many “answers” did not solve user problems at all.

We later refined our definition of “done”:

Surface Completion: Agent responded
Function Complete: Agent performed the correct action
Goal accomplished: Users’ real needs are met

Take “query order status” as an example:

Surface completion: Agent replied “Your order status is…”
Function completed: Agent queried the correct order system and returned the accurate status
Goal accomplished: The user not only knows the status, but also understands why there is this status and what to do next.

Only “goal completion” is true completion.

The Dilemma of Result Correctness

Even if the agent completes the task, whether the result is correct needs to be carefully defined.

We encountered a case: Agent calculated the return on investment for the user, and the calculation process was completely correct. But the user later complained that “this completely misled me” - because the agent did not consider the tax impact, which is crucial to the user’s investment decision.

Technically correct, business wrong. This is a common pitfall in agent quality assessment.

Second dimension: reliability - whether the agent is stable and predictable

Reliability does not equal accuracy. An agent may give suboptimal but stable answers every time, which is preferable in some scenarios to good and bad times.

The Paradox of Stability

We pursued “best answer” early on, allowing the agent to flexibly adjust based on context. The result is that the same question may get different answers at different times, leaving users confused: “Why did you say A last time and B this time?”

Later, we adjusted the strategy: for standardized questions, we pursued consistency in answers; for open-ended questions, we allowed flexibility. This significantly increases user trust.

Robustness Challenge

Agents must be able to handle various “imperfect” inputs: typos, grammatical errors, colloquial expressions, mixed languages, etc.

We have tested the sensitivity of an Agent to input perturbations:

Original input: “Recommend a good hotel in Beijing”
Disturbance 1: “Recommend a good wine spot in Beijing” (typo)
Disturbance 2: “Recommend a good hotel in Beijing” (modal particle)
Disturbance 3: “Can you recommend a good hotel in Beijing?” (euphemistic expression)

Ideally, the agent should give consistent or equivalent answers to all these variations. However, our tests show that under certain variants, the Agent’s understanding deviates significantly.

The third dimension: efficiency - the cost at which the agent completes the task

The efficiency dimension is particularly important in agent systems because costs can get out of control.

Multiple Dimensions of Cost

The cost of Agent is not only the direct API call fee, but also includes:

Token cost: consumption of input and output tokens
Delay Cost: User waiting time
Step Cost: The number of steps required to complete a task
Retry Cost: Number of retries after failure

We once encountered an Agent that cost an average of $0.15 for a single query, which doesn’t seem high. But as the number of users increased, the monthly cost reached $50,000, far exceeding the budget.

After in-depth analysis, we found that the Agent tends to “overthink” when dealing with ambiguous problems - calling LLM multiple times, repeating query tools, and generating lengthy explanations. These are “correct” in terms of functionality, but “wasteful” in terms of efficiency.

Efficiency vs. Quality Tradeoff

To further complicate matters, efficiency and quality are often contradictory.

We tried reducing costs by reducing steps, and the result was a 12% drop in task completion rate. We also tried using a cheaper model, and the result was an 18% drop in user satisfaction.

In the end we adopted a layered strategy:

Simple problems can be solved quickly using lightweight models
Complex problems are processed in depth using strong models
High-value users give priority to quality assurance and are not sensitive to cost.
Ordinary users optimize costs while ensuring basic quality

The fourth dimension: User experience - how does the user feel when interacting with the Agent?

This is the most overlooked but most important dimension.

User Experience Map

We drew a complete user experience map in one project:

Discovery: How does the user know the existence of the Agent?
First use: How is the onboarding experience?
Continuous use: Can a usage habit be formed?
Encountered a problem: How was the error recovery experience?
Getting help: What is your experience with human intervention?
Recommend others: Are you willing to recommend?

Each link has key indicators:

Discovery → First Use: Conversion Rate
First use→continuous use: next day/week retention rate
Sustained use: frequency and depth of use
Encounter a problem → Get help: Problem resolution rate
Get help → Recommend others: NPS

A counter-intuitive discovery

In a customer service agent project, our initial optimization goal was to “reduce the rate of manual intervention” - we thought it was good if the agent could solve the problem independently.

However, user surveys show that many users hope to be able to communicate with humans at critical moments. A fully automated experience leaves them feeling “invisible.”

We later adjusted our strategy: proactively providing manual options at key nodes, even if the agent has the ability to complete it independently. The result is an increase in human intervention rates (a “bad” metric), but a significant increase in user satisfaction (a “good” metric), and ultimately improved business metrics.

**Key Insight: The goal of user experience optimization is not to “eliminate manual labor” but to “provide the right help at the right time.” **

Fifth Dimension: Security - Is the Agent safe and controllable?

This is the dimension that is most likely to go wrong and has the most serious consequences.

Multiple levels of security

Agent security includes multiple levels:

Content Safety: No harmful, biased, or discriminatory content
Privacy and Security: Do not disclose sensitive information
Operational Safety: Do not perform dangerous operations
System Security: Not exploited by attacks such as prompt word injection

We once had a serious privacy breach in one of our projects. When the agent was processing a user query, it inadvertently included another user’s personal information in the reply. The reason is that there is a bug in context management, which causes the data isolation between sessions to fail.

This incident made us realize: **Agent’s security risk is much higher than that of traditional software because its behavior is dynamically generated and difficult to review in advance. **

Chapter 3: LLM Judge—Hope and Disillusionment

LLM Judge’s Commitment

When traditional evaluation methods are unable to cope with the complexity of the Agent, LLM Judge emerges as the times require - using another LLM to evaluate the output quality of the Agent.

The idea is attractive:

Automation: No manual annotation required
Consistency: No fatigue, unified standards
Scalable: Any number of samples can be evaluated
Multi-dimensional: Multiple quality dimensions can be assessed simultaneously

We started the practice of LLM Judge with high hopes.

three stages of disillusionment

Stage 1: Over-optimism

We let GPT-4 serve as a judge to evaluate our customer service agent. We provide detailed evaluation criteria, and GPT-4 gives a score and explanation of 1-5 points.

Preliminary results look good: Judge’s ratings are consistent with human ratings at 0.82 (Pearson’s correlation coefficient). We consider this to be of a usable standard.

Phase 2: Discover deviations

When we expanded the scope of our assessment, problems began to emerge.

We found that Judge had systematic biases in certain types of responses:

Length Bias: Longer responses tend to score higher, even if lengthy
Format Bias: Answers in Markdown format score higher than plain text
Tone Bias: A confident tone scores higher than a cautious tone
Brand Bias: Answers that mention our product are scored higher than answers that mention competing products (even if the latter is more helpful to the user)

These deviations are not random but systematic. They reflect biases in the training data.

Phase 3: Calibration Dilemma

After discovering the problem, we tried to calibrate Judge.

We collected 1,000 samples, which were manually labeled, and then used these samples to “train” Judge’s prompt words. We repeatedly adjust the prompt words with the goal of making Judge’s ratings as consistent as possible with human ratings.

After two weeks of hard work, the consistency improved to 0.88. But when we use this Judge to evaluate new samples, the consistency drops to 0.75.

Overfitting. Our Judge “learned” the characteristics of those 1,000 samples, but did not learn general evaluation capabilities.

The correct way to use LLM Judge

After these setbacks, we summarized the correct way to use LLM Judge:

1. Clarify the applicable boundaries

LLM Judge is suitable for:

Initial screening of obvious good and bad
Provide reference for multi-dimensional scoring
Identify potential issues for human review

LLM Judge is not suitable for:

As the only quality criterion
Final review of substitute labor
Assessing highly subjective dimensions (such as “creativeness”)

2. Multi-Judge voting

Don’t rely on a single Judge. We later adopted the “three-judge voting” mechanism:

Use three different models as Judge (GPT-4, Claude, Gemini)
It will be accepted only if three judges agree.
Largely divergent samples require manual review

This significantly improves the reliability of the assessment, but also increases the cost by three times.

3. Continuous monitoring of calibration

Judge’s performance will drift over time. We have established a continuous monitoring mechanism:

Sampling 100 samples every week and manually re-labeling
Calculate the consistency between Judge scoring and manual scoring
Trigger recalibration when consistency falls below threshold

Chapter 4: Traps and Breakthroughs of A/B Testing

The particularity of A/B testing in Agent systems

A/B testing is the gold standard for evaluating agent improvements. However, the non-determinism of the Agent system brings special challenges to A/B testing.

Challenge 1: Variance is too large

In A/B testing of traditional software, the variance of indicators is relatively small. However, the variance of Agent indicators (such as user satisfaction) is often very large.

We have calculated that to achieve statistical significance (p<0.05), traditional software may require 1,000 samples, while the Agent system may require 10,000 samples. This means longer testing cycles, higher costs, and greater business risks.

Challenge 2: Long Tail Effect

Agent improvements often have a “long tail effect.”

We once improved the prompt words, hoping that the Agent’s answers would be more empathetic. A/B testing shows:

In the short term (within 1 week), satisfaction with the new version dropped by 5%
In the long term (after 4 weeks), satisfaction with the new version increased by 12%

The reason is: users need time to adapt to the new interaction style. In the short term, old users feel “unaccustomed”; in the long term, the new style is recognized.

If we make decisions based on the traditional A/B testing cycle (1-2 weeks), we will mistakenly abandon a version that is actually better.

Challenge Three: Interaction Effect

There are complex interactive effects between the components of Agent.

At one point we improved two components simultaneously: intent recognition and answer generation. When tested individually, both improvements improved the metrics. But when tested together, the indicator dropped.

The reason is: the new intent recognition changes the way questions are classified, and the new answer generation is optimized based on the old classification way. The two are “incompatible”.

Best practices for A/B testing

Practice 1: Layered Experiment

We divide the Agent system into multiple levels, and experiments are also conducted at corresponding levels:

Model layer: LLM selection, parameter adjustment
Prompt word layer: system prompts, role settings
Strategy layer: Tool selection, execution sequence
Interaction layer: UI copywriting, interaction process

Experiments at each level are conducted independently to reduce interaction effects.

Practice 2: Multi-indicator guardrails

We not only look at the main indicators (such as satisfaction), but also set up multiple guardrail indicators:

Cost Index: The cost of a single conversation cannot increase by more than 10%
Latency Metric: Average response time cannot increase by more than 20%
Error Metric: The error rate cannot increase

If any guardrail indicator is triggered, the experiment automatically stops.

Practice 3: Progressive Promotion

Even if the experimental results are significant, we use progressive generalization:

Week 1: 5% traffic
Week 2: 20% traffic
Week 3: 50% traffic
Week 4: 100% traffic

Core indicators are monitored at every stage, and problems can be rolled back in time if problems are found.

4.4 Team Capacity Building

Agent quality assessment requires the support of a cross-functional team, including:

Data Scientist: Responsible for building evaluation models, analyzing indicator correlations, and designing A/B tests Domain Expert: Responsible for defining quality standards, annotating training data, and reviewing edge cases Engineer: Responsible for implementing evaluation pipelines, monitoring systems, and automation tools Product Manager: Responsible for defining business goals, weighing quality and cost, and promoting improvements.

These roles need to work closely together rather than working in silos. We have established weekly “Quality Synchronization Meetings” to allow all parties to share findings, discuss issues, and coordinate actions.

4.5 Ethical considerations for evaluation

Quality assessment is not only a technical issue, but also involves ethical issues.

Sample Bias: Is our test set representative of the entire user population? Is it biased against certain groups of people?

Privacy Protection: How is user data processed during the evaluation process? How much information can a human evaluator see?

Transparency: Do we tell users that their conversations are used for quality assessment? Do they have the right to say no?

There are no standard answers to these questions, but they need to be taken seriously. We have included clear data usage instructions in our products and provide opt-out options.

Chapter 5: Advice for Practitioners

Checklist for getting started

If you are or are about to start building an Agent quality evaluation system, the following is a checklist I summarized based on my experience in pitfalls:

Must have basic abilities

Clearly define the criteria for “task completion” (distinguish between surface completion, functional completion, and goal completion)
Establish a multi-dimensional assessment framework (not just accuracy)
Collect and maintain real user conversation data as a test set
Establish a basic manual evaluation process

STRONGLY RECOMMENDED ABILITIES

Implement LLM Judge as a preliminary screening tool
Build A/B testing capabilities
Set up production monitoring and alarms
Establish error classification and analysis mechanism

Advanced abilities (depending on the scenario)

Automated evaluation pipeline
Multidimensional Quality Dashboard
Predictive Quality Analysis
Continuous learning mechanism

Common evaluation pitfalls

Trap 1: Overreliance on automated assessment

Symptoms: Complete reliance on LLM Judge or automated indicators, ignoring manual assessment.

Consequences: Systematic evaluation bias leads to quality blind spots. Automated assessments can only capture superficial, quantifiable characteristics but cannot assess the deep, subjective experience. For example, an automated system may think “Answer containing keywords = correct”, but the user’s actual perception may be “mechanical and inconsiderate”.

Recommendation: Automated assessment can only be used as a preliminary screening, and key decisions must involve manual participation. Our recommended ratio is: automated assessment for 80% of sample screening and manual assessment for 20% of in-depth review.

Trap 2: The evaluation set is disconnected from the production data

Symptoms: The test set is a “clean” problem, and the production environment is a “messy” problem.

Consequences: Test pass rates are high, but user satisfaction is low. Agent performs well under ideal conditions, but frequently makes mistakes in the face of typos, colloquialisms, and jumping thinking of real users.

Recommendation: Regularly sample and update the test set from the production environment to ensure consistent distribution. We recommend updating the test set monthly to remove outdated samples and add newly discovered problem types.

Trap 3: Pursuing a single indicator

Symptoms: Focus only on satisfaction or accuracy and ignore other dimensions.

Consequences: Optimizing a single metric leads to deterioration in other areas. For example, excessive pursuit of “accuracy” may lead to overly conservative answers, an increase in the frequency of “I don’t know”, and users feeling that the Agent is “useless”.

Suggestion: Establish a balanced multi-dimensional indicator system and avoid single indicator optimization. The indicator combinations we recommend include: functional indicators (accuracy rate, completion rate), efficiency indicators (cost, delay), experience indicators (satisfaction, NPS), and security indicators (violation rate, complaint rate).

Trap 4: Ignoring the long-term effects

Symptoms: The A/B testing cycle is too short and only looks at short-term indicators.

Consequences: Missing out on truly valuable improvements, or introducing changes with long-term risks. Some improvements require a user adaptation period, and short-term indicators may decline, but they are beneficial in the long term.

Suggestion: Set indicators for different periods, and pay attention to short-term (1 week), mid-term (January), and long-term (March). For major changes, the recommended minimum observation period is 4 weeks.

Trap 5: Assessment is disconnected from business goals

Symptom: Assessment metrics look great, but business metrics don’t improve.

Consequences: Evaluation becomes “self-entertainment” and cannot guide actual improvement.

Suggestion: Establish a mapping relationship of “evaluation indicators-business indicators”. For example, “task completion rate” should be related to “customer service cost”, and “user satisfaction” should be related to “retention rate”. Verify this correlation regularly, and if a disconnect is found, the assessment metric needs to be adjusted.

Trap 6: Ignoring the cost of the assessment itself

Symptoms: A complex assessment system is established but is expensive to maintain.

Consequences: Evaluation becomes a burden. The team evaluates in order to cope with the evaluation and loses the original intention of the evaluation.

Recommendation: The complexity of the assessment system should match the maturity of the system. Use simple assessments in the early stages and increase complexity as they mature. Always ask yourself: Will this assessment help us make better decisions? If not, it should be simplified.

5.3 Evolution path of quality assessment

Based on our experience, the evolution of Agent quality assessment capabilities can be divided into four stages:

Phase 1: Manual Assessment (0-3 months)

Manual spot check of conversation records
Simple classification of good and bad
Fix the problem manually after discovering it

The key to this stage is to establish quality awareness rather than a perfect evaluation system.

Phase 2: Semi-automated assessment (3-6 months)

Establish a rule engine for preliminary screening
Sample of manual review rule flagged
Start collecting user feedback

The key to this stage is to establish a feedback loop so that the evaluation results can guide improvements.

Phase 3: Automated Assessment (6-12 months)

LLM Judge is online with automatic scoring
Building A/B testing capabilities
Multi-dimensional indicator dashboard

The key at this stage is to improve efficiency so that the evaluation can keep up with the iteration speed.

Phase 4: Smart Assessment (12 months+)

Predictive quality analysis (warning of problems before they occur)
Automatic root cause location
The deep connection between quality and business

The key to this stage is to shift from post-assessment to pre-prevention, and from reactive response to proactive optimization.

5.4 Suggestions for selecting assessment tools

Open Source Tools:

OpenAI Evals: suitable for quick start, with ready-made evaluation templates
Promptflow: Produced by Microsoft, well integrated with the Azure ecosystem
LangSmith: LangChain package, suitable for LangChain projects

Business Tools:

Weights & Biases: powerful experimental tracking, suitable for research scenarios
Helicone: LLM observability professional tool
Langfuse: open source + commercial support, high cost performance

In-house research vs. external procurement:

Early advice on getting started quickly with open source tools
In the mature stage, consider self-research according to special needs.
The choice of assessment tool should be considered in conjunction with the choice of LLM platform

Trap 3: Pursuing a single indicator

Symptoms: Focus only on satisfaction or accuracy and ignore other dimensions.

Consequences: Optimizing a single metric leads to deterioration in other areas.

Suggestion: Establish a balanced multi-dimensional indicator system and avoid single indicator optimization.

Trap 4: Ignoring the long-term effects

Symptoms: The A/B testing cycle is too short and only looks at short-term indicators.

Consequences: Missing out on truly valuable improvements, or introducing changes with long-term risks.

Suggestion: Set indicators for different periods, and pay attention to short-term, mid-term and long-term.

Appendix: Three real quality assessment fraud cases

To illustrate the complexity of quality assessment more concretely, I selected three real-life cases for in-depth analysis.

Case 1: The “perfect” automated assessment system

Background: We spent two weeks building an automated evaluation system in a customer service agent project. The system uses a rules engine to check whether the answer contains keywords, meets format requirements, and calls the correct tool.

Problem Phenomenon: The pass rate of the test set reached 92%, and we went online with confidence. However, the user satisfaction score of the production environment is only 3.1 points, far lower than the expected 4.0 points.

Troubleshooting:

We manually reviewed 100 “passed” test samples and found serious issues:

Although 32% of the answers were “formatted correctly”, their content was empty and did not solve user problems.
18% answered “Calling the right tool” but using the wrong parameters
15% of the answers are literally “correct” but the tone is blunt and makes users feel uncomfortable

Our automated assessment only examined “surface characteristics” and did not assess “actual value”.

Solution:

We later introduced tiered evaluation:

First level: Rule checking (quickly screening obvious problems)
Second level: Semantic evaluation (use LLM Judge to evaluate content quality)
The third level: manual sampling (manual confirmation of key samples)

After implementing hierarchical evaluation, the “pass rate” of the test set dropped to 78%, but the satisfaction rate of the production environment increased to 4.2 points.

Lesson Learned: High pass rate does not equal high quality. Automated assessment can only be used as an initial screening and cannot replace in-depth quality assessment.

Case 2: The user experience that was “optimized” bad

Background: Our Agent’s initial responses were relatively detailed, containing an average of 200 words. Users reported that “it’s too long and makes me tired”.

Optimization Action: We have optimized the prompt words and require the Agent to “answer concisely, no more than 50 words”. Tests showed that answer lengths dropped and user satisfaction increased by 8%.

Problem Phenomenon: Two weeks after going online, the customer service team reported that “the number of inquiries has increased sharply, and many users have asked repeatedly.”

Troubleshooting:

After in-depth analysis, we found:

Concise answers are indeed easier to read
However, the answers to many complex questions are too simplified and users do not get complete information.
The user has to ask again, resulting in increased conversation turns
Although single-round satisfaction increased, the overall experience declined

We use “single round satisfaction” optimization, but harm the “end-to-end experience”.

Solution:

We have adjusted our evaluation strategy:

Added the “one-time problem resolution rate” indicator
Added “whether the user asked the same question repeatedly” indicator
Allow the agent to dynamically adjust the answer length based on the complexity of the question

After implementation, although the average answer length increased to 120 words, the repeat inquiry rate dropped by 40%, and the overall satisfaction level was higher.

**Lesson learned: Local optimization can lead to global deterioration. Quality assessment must focus on the end-to-end experience rather than a single metric. **

Case 3: The “stable” Agent suddenly lost control

Background: Our Agent has been running stably for three months, and all monitoring indicators are normal.

Problem Phenomenon: One afternoon, the Agent suddenly began to give a large number of wrong answers. Monitoring shows that the “success rate” is still 95%, but user complaints have surged.

Troubleshooting:

The investigation found:

That afternoon, the backend database underwent an unplanned structural change.
The Agent’s tool call is still “successful” (HTTP 200), but the returned data structure has changed.
Agent generated the answer with wrong data, but the monitoring system considered “call successful” as “task successful”

Our monitoring system only monitors “tool calls” and does not monitor “result correctness”.

Solution:

We have enhanced our monitoring system:

Monitor data schema compatibility
Monitor business indicators for anomalies (not just technical indicators)
Establish end-to-end health checks and regularly verify critical paths

At the same time, we have established an evaluation process for database changes: any backend changes need to evaluate their impact on the Agent.

**Lessons learned: Monitoring should not only focus on “whether the system is running”, but also on “whether the system is running correctly”. Normal technical indicators do not mean normal business indicators. **

Conclusion: Quality is the basis of trust

Going back to the failure case at the beginning of the article - if we had a truly complete quality assessment system:

It will identify blind spots in dialect understanding during the testing phase
It will catch edge-scenario issues in A/B testing
It will promptly detect abnormal trends in production monitoring
It triggers alerts and automatic downgrades before problems escalate

The results may be completely different.

**Agent quality assessment is important because it determines whether we can trust the AI system to serve real users in a production environment. **

An agent without evaluation is like a car without brakes - it may run very fast, but it may lose control at any time. An agent that has an assessment but an incorrect assessment is like a malfunctioning brake - it gives a false sense of security and is even more dangerous. **Only an Agent with a complete quality evaluation system can become a truly reliable and trustworthy intelligent partner. **

This is my biggest realization in more than two years of practice, and it is also the core message I want to convey in this article.

Today, with the rapid development of AI capabilities, we often focus too much on “what it can do” and ignore “how well it is done.” But only the latter can truly determine the success or failure of AI products.

**Quality is not an add-on to functionality, it is the cornerstone of trust. **

Reference resources

original:

Kaggle Agent Quality Whitepaper

Related Reading:

Evaluating LLMs as Judges
Human Evaluation of Conversational AI

Tool recommendation:

OpenAI Evals: https://github.com/openai/evals
Promptflow: https://github.com/microsoft/promptflow
LangSmith: https://smith.langchain.com/

*This article is an original practical summary, written based on personal project experience. *

Last updated: 2026-03-12

Reading path

Continue along this topic path

Follow the recommended order for AI engineering practice instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Original interpretation: Agent quality assessment - the cornerstone of trust in the AI era

Introduction: The evaluation system that overturned immediately after it went online

Chapter 1: Why Agent Quality Assessment is “Mission Impossible”

Misunderstanding 1: Treating Agent quality as an “accuracy issue”

Myth 2: Underestimating the challenge of uncertainty

Misunderstanding 3: Ignoring subjective quality standards

Misunderstanding 4: Confusing “passed testing” and “production ready”

Chapter 2: Five-dimensional evaluation framework of Agent quality

The first dimension: functionality - whether the agent completed the task

Second dimension: reliability - whether the agent is stable and predictable

The third dimension: efficiency - the cost at which the agent completes the task

The fourth dimension: User experience - how does the user feel when interacting with the Agent?

Fifth Dimension: Security - Is the Agent safe and controllable?

Chapter 3: LLM Judge—Hope and Disillusionment

LLM Judge’s Commitment

three stages of disillusionment

The correct way to use LLM Judge

Chapter 4: Traps and Breakthroughs of A/B Testing

The particularity of A/B testing in Agent systems

Best practices for A/B testing

4.4 Team Capacity Building

4.5 Ethical considerations for evaluation

Chapter 5: Advice for Practitioners

Checklist for getting started

Common evaluation pitfalls

5.3 Evolution path of quality assessment

5.4 Suggestions for selecting assessment tools

Appendix: Three real quality assessment fraud cases

Case 1: The “perfect” automated assessment system

Case 2: The user experience that was “optimized” bad

Case 3: The “stable” Agent suddenly lost control

Conclusion: Quality is the basis of trust

Reference resources

Continue along this topic path

Go deeper into this topic

Subscribe to updates

Comments and discussion

Introduction: The evaluation system that overturned immediately after it went online

Chapter 1: Why Agent Quality Assessment is “Mission Impossible”

Misunderstanding 1: Treating Agent quality as an “accuracy issue”

Myth 2: Underestimating the challenge of uncertainty

Misunderstanding 3: Ignoring subjective quality standards

Misunderstanding 4: Confusing “passed testing” and “production ready”

Chapter 2: Five-dimensional evaluation framework of Agent quality

The first dimension: functionality - whether the agent completed the task

Second dimension: reliability - whether the agent is stable and predictable

The third dimension: efficiency - the cost at which the agent completes the task

The fourth dimension: User experience - how does the user feel when interacting with the Agent?

Fifth Dimension: Security - Is the Agent safe and controllable?

Chapter 3: LLM Judge—Hope and Disillusionment

LLM Judge’s Commitment

three stages of disillusionment

The correct way to use LLM Judge

Chapter 4: Traps and Breakthroughs of A/B Testing

The particularity of A/B testing in Agent systems

Best practices for A/B testing

4.4 Team Capacity Building

4.5 Ethical considerations for evaluation

Chapter 5: Advice for Practitioners

Checklist for getting started

Common evaluation pitfalls

5.3 Evolution path of quality assessment

5.4 Suggestions for selecting assessment tools

Appendix: Three real quality assessment fraud cases

Case 1: The “perfect” automated assessment system

Case 2: The user experience that was “optimized” bad

Case 3: The “stable” Agent suddenly lost control

Conclusion: Quality is the basis of trust

Reference resources

Continue along this topic path

Technical Interpretation Index | Curated Translations

Original interpretation: Discovery and prevention of silent hallucination in RAG system

Original interpretation: How AI Agent implements large-scale testing quality access control

Continue with this topic

Original interpretation: MCP protocol - the USB-C moment of the Agent ecosystem

Original Interpretation: Contextual Engineering—The Forgotten Core Battlefield in the AI ​​Era

Original interpretation: Kaggle white paper "Introduction to Agents" - AI Agent introduction and architecture panorama

Go deeper into this topic

Subscribe to updates

Comments and discussion

Original Interpretation: Contextual Engineering—The Forgotten Core Battlefield in the AI Era