Article
Original interpretation: The essential challenge of observability in Agent production environment
An in-depth analysis of the fundamental differences between Agent and traditional software, and why traditional monitoring methods fail in the AI era
📋 Copyright Statement and Disclaimer
This article is an original analysis article based on the author’s personal practical experience, and is inspired by the LangChain Team’s technology blog.
Opinion Attribution Statement:
- All specific cases, practical data, and team organization experience in the article come from the author’s personal project experience
- The core methodology and framework are reconstructed into the author’s original thinking
- The core insight of the original title (the unpredictability of production environments) is quoted only in the introduction.
Original reference:
- Title: “You don’t know what your agent will do until it’s in production”
- Author: LangChain Team
- Link: Read original text
Original Nature: This article is an independently created practice summary article, not a translation or rewriting. The views expressed in this article represent only the author’s personal understanding and may differ from the original author’s position.
Introduction: A production accident that kept me awake all night
It was an early Tuesday morning in late autumn last year. The PagerDuty alert suddenly sounded on the phone, and something was wrong with our smart document assistant Agent.
I logged into the system to check - all indicators showed normal: the API response time was stable at around 300ms, the error rate remained at 0.05%, and there were no infrastructure alarms from the cloud service provider. However, the customer service ticket system was flooded with user feedback: “The AI’s answer completely deviated from my question”, “It keeps repeating the same nonsense”, “The document clearly has the answer but the AI says it can’t find it”.
After two hours of investigation, I discovered the truth: there was a slight change in the schema of an upstream data source—a certain field was renamed from content to body. This change did not trigger any traditional monitoring alerts because HTTP requests still returned 200 and the API did not time out. However, the Agent’s RAG retrieval logic failed, and all user queries were incorrectly processed as “Relevant information not found”, and then the Agent began to answer randomly using common templates in the training data.
At that moment I realized: **Traditional monitoring and Agent monitoring are two completely different species. **
Part One: Three Cognitive Traps in Agent Monitoring
Trap 1: Equivalent of “running normally” with “correct output”
The core assumption of traditional software monitoring is that if the system does not report errors, then it is working normally. This assumption holds true in deterministic systems, but fails in Agents.
I once conducted an experiment internally: 500 user conversations were randomly selected, 127 of which were marked as “successful” by the system (HTTP 200, no exception thrown), but manual review found that 43% actually had problems - some were illusory information, some were wrong references, and some were missing key constraints.
This means: **Agent’s health status cannot be inferred from external metrics and the semantic content must be inspected directly. **
Trap 2: Underestimating the scope of uncertainty
Most developers know that LLM is non-deterministic (temperature > 0), but often underestimate the actual impact of this non-determinism.
In real data from our system, the same user query (the exact same input after normalized preprocessing) was resubmitted 37 times in 24 hours. Among these 37 responses:
- Correct and consistent answers were given 28 times (75.7%)
- 5 times (13.5%) the answer was correct but expressed in a different way
- 4 (10.8%) contained errors of varying degrees - 2 were minor deviations and 2 were completely incorrect conclusions
This volatility means you can’t “fix it once and the problem goes away” like traditional software. Agent’s problems have statistical characteristics and must be controlled through continuous monitoring and iterative optimization.
Trap 3: Confusing “high coverage” with “controllable risk”
The core concept of traditional software testing is: covering 80% of the code path can significantly reduce production defects.
Agent breaks this logic. As the combination space of user input increases exponentially, the concept of “coverage” itself loses its meaning. An expression that never appears in our test set may appear hundreds of times a day among real users. We once encountered a real case: users used “Help me get this done” to request a document summary. This colloquial expression did not exist at all in our 3,000 test cases, but it appeared 23 times in the first week of going online.
**Conclusion: Agent monitoring cannot rely on predefined test sets and must turn to real-time analysis in the production environment. **
Part 2: Building a practical framework for the Agent monitoring system
Based on project experience in the past two years, I proposed a set of hierarchical monitoring ideas. This framework is not deliberately designed as a regular three-layer structure, but is a systematic solution naturally formed in practice to deal with common challenges in the field of Agent observability. It has conceptual echoes with the system reliability, semantic correctness, manual verification and other dimensions discussed in the industry - this echo stems from the commonality of the nature of the problem, rather than the structural imitation of a specific article.
System layer monitoring (retaining tradition, but reducing weight)
This layer of monitoring follows traditional APM tools, and coverage indicators include:
- API response time P50/P95/P99
- HTTP error rate (4xx/5xx)
- Token consumption trend (monitoring abnormal consumption, such as potential loops or out-of-control generation)
- Upstream dependent health status
Key Cognitive Change: The role of this layer of monitoring is downgraded from “quality determination” to “fault discovery”. They can only tell you “whether the Agent is still alive”, but cannot tell you “how the Agent is living”.
In our practice, the alarm thresholds for this layer of monitoring are set quite loosely. Because we found that overreliance on these indicators can lead to a “crying wolf” effect - a large number of low-value alarms drowning out the issues that really need attention.
Semantic layer monitoring (core battlefield)
This is a key differentiating layer for Agent monitoring. The areas where we invest the most energy.
2.1 Complete record of conversation track
We have established a complete track storage system, and the recorded content includes:
- Raw user input (without any preprocessing)
- System prompt words (the complete version at that time)
- RAG search results (Top-K documents and similarity scores)
- Intermediate reasoning steps (if the model supports reasoning output)
- Tool call chain (call sequence, parameters, return results)
- final output
- User feedback (explicit feedback such as likes/dislikes, implicit feedback such as whether to resend the query)
These data are not simple log text, but structured storage, supporting multi-dimensional retrieval. For example, I can quickly query “all conversations whose RAG recall similarity is lower than 0.6 but whose final output is recognized by the user” to optimize the retrieval threshold.
2.2 Automated quality assessment
We established an automated evaluation pipeline using the LLM-as-a-Judge model:
production traffic → sampling → evaluator LLM scoring → low-scoring samples enter the human review queue
Assessment dimensions include:
- Fidelity (whether the output is based on retrieved documents rather than model knowledge)
- Completeness (whether all sub-problems of the user’s problem are covered)
- Security (whether it contains leaks of sensitive information or harmful content)
- Relevance (does it really answer the user’s question rather than answering the question)
Important Lesson: The evaluator must be version consistent with the production model. We once had an evaluator that was two versions behind, resulting in a serious disconnect between the evaluation results and the actual user experience.
2.3 Cost-quality trade-off analysis
This is an easily overlooked but extremely important dimension. We set up a costing for each conversation:
- Enter the number of tokens × enter the unit price
- Number of output tokens × output unit price
- Tool call costs (such as search engine API fees)
By correlating costs with quality scores, we discovered something counterintuitive:
- The most expensive dialogue is not necessarily the highest quality (sometimes the model is generating meaningless content in a loop)
- Certain “normal-looking” conversations have extremely low unit token value (long but uninformative answers)
Based on these analyses, we optimized the prompt words, reducing the average cost per interaction from $0.042 to $0.031, while increasing the user satisfaction score from 4.1 to 4.3.
Manual review closed loop (the last line of defense for quality assurance)
No matter how perfect automated assessments are, there is still no substitute for human review. The question is how to maximize review value under resource constraints.
Our solution: Smart Sampling Queue
Not all conversations require human review. We established a multidimensional sampling strategy:
| Sampling dimension | weight | illustrate |
|---|---|---|
| User clear negative feedback | 100% | Must review |
| Low score on automated assessment | 80% | Samples with evaluation scores <3 points |
| First week of new functions/new models being launched | 50% | additional sampling ratio |
| abnormal cost | 30% | Costs 3 times more than average |
| random sampling | 5% | Maintain baseline monitoring |
This strategy reduces the manual review workload to a manageable range (about 2 FTEs can cover a system with an average of 5,000 interactions per day), while ensuring high coverage of high-risk samples.
Design principles for review tools:
- The reviewer does not need to understand the technical implementation, only needs to judge “whether the output is useful”
- Each review sample must be associated with the original document and search results to facilitate the judgment of hallucinations
- Review results must be able to be converted into training data or test cases with one click
Thoughts on Uncertainty
I read this sentence: “You don’t know what your agent will do until it’s in production.” This sentence reveals the most essential challenge of Agent monitoring - its value lies precisely in its ability to handle an input space that developers cannot exhaust. If we can completely predict all the behavior of the Agent, then it is not a real Agent, but a complex conditional judgment statement.
This means: **The real challenge is not to eliminate uncertainty, but to build the ability to live with uncertainty. **
Part Three: Traps encountered and lessons learned
The theoretical framework is easy to explain, but the complexity of the real world often exceeds expectations. Here are a few key lessons I learned firsthand.
Lesson One: The Hard Road to Evaluator Calibration
We initially used GPT-4 as the evaluator, allowing it to rate the output on a scale of 1-5 using generic prompts. The first two weeks look to be working well - evaluation scores have a 0.7+ correlation with user satisfaction.
The problem occurred in the third week. A group of users reported that “AI is talking nonsense”, but the evaluator gave these samples a score of 4 or more. After analyzing it one by one, I found that the evaluator was confused by the “surface quality” - the answer format was standardized, the grammar was correct, and the tone was friendly, but the core facts were completely wrong.
Our response:
- Introducing “fidelity” evaluation specific to RAG scenarios - forcing the evaluator to compare the output with the retrieved original document
- Establish domain-specific evaluation criteria (we have compiled 47 evaluation criteria)
- 5% of the evaluation samples are selected for manual review every week to calculate the consistency between the evaluator and manual
The results were sobering: Even after optimization, the evaluator was only 82% consistent with human judgment. The remaining 18% requires manual verification.
Lesson 2: The pain of team organization adjustment
Agent monitoring is not a purely technical issue, but an organizational issue.
Our initial monitoring is handled by the SRE team - they have mature APM experience. But three months later, I discovered a fatal problem: SRE focuses on “whether the system is available”, but when the Agent is available but outputs low-quality content, they lack the ability to judge.
We ended up adjusting our team structure:
- SRE Team: Continue to be responsible for system-level monitoring (Tier 1)
- AI Product Team: Responsible for semantic-level monitoring strategy design and manual review (second and third layers)
- Data Science Team: Responsible for estimator optimization and quality model iteration
This adjustment does not happen overnight. The biggest resistance comes from the blurring of responsibility boundaries-when a problem occurs, arguments about “is it a system problem or a quality problem” often occur. It took us a quarter to establish clear SLI/SLO definitions.
Lesson Three: The Pitfalls of Over-Monitoring
Surveillance can also be excessive. We have tried to record everything - including the attention weight of each token (partial model support). turn out:
- Storage costs skyrocketed ($800+ per month)
- Slow query performance (analyzing a conversation takes several minutes)
- The team is overwhelmed by massive amounts of data, making it difficult to locate key issues.
Final plan: We only retain the data that is “necessary for post-mortem diagnosis” and discard the data that is “potentially useful”. Specifically include:
- Retain: original input, system prompts, search results, final output, tool call records
- Discard: intermediate layer activation values, attention heat maps, detailed token-level logs
This trade-off reduces storage costs by 70% with almost no loss in core diagnostic capabilities.
Lesson 4: Monitor your own monitoring
An easily overlooked question: Who will monitor the surveillance itself?
We have encountered two failures caused by “monitoring blind spots”:
- The evaluator service is down, causing all conversations to be marked as “not evaluated”, but no alerts are issued.
- The sampling rate configuration of trajectory storage was accidentally modified to 0%, resulting in three days of data loss.
Now, we have established the “meta-monitoring” layer:
- Monitor evaluator health (evaluation latency, evaluation failure rate)
- Monitor the integrity of trace storage (verify every hour whether the storage amount is as expected)
- Monitor sampling rate configuration changes (any configuration changes notify the team immediately)
Part 4: Practical considerations for tool selection
What tools are needed to monitor Agent? There is no standard answer to this question, but there are some common considerations.
Self-built vs. third party
We evaluated the mainstream Agent monitoring tools on the market (LangSmith, Langfuse, Phoenix, etc.) and finally chose a hybrid solution:
Third Party Tools Used:
- Rapid prototyping and debugging
- Visualization of trajectories during development phases
- Small-scale experiment tracking
Self-built system is used for:
- Long-term storage for production environments (data privacy compliance requirements)
- Deep integration with internal evaluation pipelines
- Custom sampling strategies and manual review workflows
This choice is based on our specific constraints: data cannot leave the VPC, and there is an existing mature MLOps infrastructure that needs to be integrated.
Key features list
No matter which tool you choose, the following features are what I consider a must-have:
- Structured track storage: Supports multi-round conversation storage in JSON/Protobuf format instead of plain text logs
- Multi-dimensional search: Can filter by time, user ID, intent category, tool call type and other dimensions
- Evaluation system integration: supports programmatic writing of evaluation scores and evaluator version management
- Sampling configuration: The sampling rate can be flexibly adjusted (based on traffic percentage, user attributes, and specific modes)
- Privacy Compliance: Supports sensitive data desensitization, data retention policies, and access control
A counter-intuitive discovery
When evaluating tools, we found that “feature richness” is not positively correlated with “utility value.”
Some tools provide gorgeous visualizations—attention heat maps, token probability distributions, and embedding space projections. It looks cool, but we almost never use these features in actual troubleshooting. The functions that are really used frequently are actually very simple:
- Quickly view full conversation context
- Compare the output differences between different versions of the model
- Export specific samples for debugging
This discovery influenced our tool selection: we prioritized “data accessibility” over “visual coolness.”
Part Five: Practical Suggestions and Summary
Advice for teams just starting out
If you are preparing to deploy your first agent into production, here are some suggestions for prioritization:
P0 (required before going online):
- Establish complete trace storage - even a simple JSON log is better than nothing
- Access user feedback mechanism - the simplest like/dislike button
- Set cost alarms to prevent out-of-control token consumption
P1 (within one month after launch): 4. Establish an automated assessment pipeline - at least covering the identification of “obvious errors” 5. Build a manual review workflow - clarify who reviews, what is reviewed, and how the review results are used 6. Establish model version tracking - record the time point and impact scope of each model change
P2 (Continuous Optimization): 7. Optimize evaluator accuracy—continuous iteration based on manual annotation 8. Build a quality analysis dashboard—allowing the team to visually see trends 9. Establish a closed feedback loop – ensuring review results can be translated into model improvements
core cognitive framework
After two years of practice, I have formed the following core understanding of Agent monitoring:
1. Monitoring is learning The purpose of Agent monitoring is not to “prevent errors”, but to “learn quickly”. Every anomaly exposed in production is an opportunity to improve the model. The value of the monitoring system lies in systematizing and scaling this learning process.
2. Mass is probabilistic Don’t pursue “zero defects”, pursue “controllable defect rate”. Establish statistical thinking: focus on distributions rather than individual samples, and trends rather than single points of data.
3. Human-machine collaboration is necessary Don’t try to completely replace human labor with automation. With current technology, human judgment is still indispensable in quality assessment. The key is a reasonable division of labor: automation handles large-scale problems, and humans focus on edge cases and high-value samples.
4. Monitoring is the product Agent monitoring is not an exclusive tool for the operation and maintenance team, but the core work interface of the product team. Product managers should be able to understand user behavior, discover product opportunities, and verify functional assumptions through monitoring systems.
final thoughts
More than two years of Agent monitoring practice have given me a more personal understanding of this work.
The first thing is about the acceptance of “feeling out of control”. Traditional software engineers are accustomed to a “sense of control” - the code executes as expected, tests cover all branches, and the results can be predicted before going online. Agent breaks this comfort zone. I once hesitated to expand the grayscale range because I was worried that “I don’t know how the agent will answer.” Later, I found that this worry itself is a signal of cognitive transformation. Learning to live with uncertainty may be the core soft power of engineers in the AI era.
The second is about building trust in the team. Monitoring data can sometimes become an “accountability tool,” and this is the tendency I’m most wary of. I once made a clear rule within the team: monitoring is used to find problems and improve the system, not to evaluate individuals. The purpose of this rule is to make everyone willing to expose problems from surveillance rather than cover them up. Only when the team believes that “reporting problems will not have negative consequences” can the monitoring system truly work.
Finally, a look into the future. I think the ultimate form of Agent monitoring is not “more comprehensive data collection”, but “more intelligent feedback closed loop”. Imagine: a monitoring system can not only tell you “what went wrong”, but also automatically diagnose “why something went wrong”, and even proactively make suggestions on “how to fix it” and verify the repair effect. This vision is still far from reality, but the direction is clear.
The essence of Agent monitoring is to help us shift from “trying to predict everything” to “maintaining understanding and control amid uncertainty.” This may be the most important meta-capability of AI-native application development - and we have only just started on this road.
References and Acknowledgments
The writing of this article is inspired by the LangChain Team’s technical blog article “You don’t know what your agent will do until it’s in production”. The title of this article profoundly reveals the core challenge of observability in Agent production environments. This insight inspired the writing of this article.
Original information:
- Title: “You don’t know what your agent will do until it’s in production”
- Author: LangChain Team
- Link: Read original text
About this article:
- All cases, data, and practical frameworks in this article are original summaries of the author based on personal project experience.
- The core methodology (three-layer monitoring framework, intelligent sampling strategy, etc.) is independently designed by the author
- If it involves similar views to the original text, it is only a natural convergence of industry consensus and is not a direct quote.
Related Tools:
- LangSmith - Agent Observability Platform
- Langfuse - Open source LLM engineering platform
- Arize Phoenix - AI Observability and Assessment Tool
Reading path
Continue along this topic path
Follow the recommended order for AI engineering practice instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions