Article
A truly mature Eval Harness will not just focus on the answer
If an eval harness can only tell you the success or failure of a task, but cannot explain whether the agent called the correct capabilities, in what environment it was executed, why it failed, and why it succeeded, then what it gives is not a systematic judgment, but just a score card. This article is based on LangChain's discussion of skills eval and extends my complete understanding of artifact-based scoring, invocation metrics, trace design, workflow eval and evaluation histology.
Original reference: Robert Xu, “Evaluating Skills”. This article is an original interpretation, not a translation.
A truly mature Eval Harness will not just focus on the answer
There are almost no teams working on agents today who don’t talk about eval. Everyone knows that evaluation is important, and the words evaluation, benchmark, judge, trace, dataset, and offline/online loop are written in almost every product roadmap. But once you really fall to the execution layer, you will soon discover a somewhat embarrassing fact: **Many so-called eval harnesses are actually just “automatic scoring shells” and do not really enter the black box of the agent system. **
They do tell you a few things: what the success rate is, which set of prompts is better, which version scores better, which model is currently leading.
But what they most often lack is exactly what the engineering team needs most: did the agent call the correct capabilities? Did it succeed because the skill really worked or happened to bypass it? If it failed, would it be useless or misused or was there something wrong with the environment? Why did it succeed this time and fail the next time? Is this set of evaluations testing the model, workflow, skill, or environmental noise?
The reason why LangChain’s “Evaluating Skills” article is worth reading carefully is not because it provides a very complex evaluation theory, but because it captures a very critical and very realistic issue: ** Once the agent’s capabilities come from dynamically loaded skills, runtime environment, execution trajectory and tool arrangement, then just looking at the final answer will almost certainly misjudge the system. **
In my opinion, what this article really promotes is not the narrow question of “how to evaluate skill”, but a larger shift: the evaluation object of **agent is no longer just the output, but a complete execution system. **
1. Why output-only eval is failing quickly
In past evaluation models, output orientation is almost natural. The reason is simple: many traditional model tasks are inherently input-output mapping. Summarize a text, answer a question, translate a sentence, classify a picture.
It is certainly reasonable to only look at the results at this time. Because the process is not part of the explicit value of the system, the process itself is difficult to observe stably.
But agents are different. The agent does not just “generate a result”, it acts in the environment: reading files, debugging tools, triggering skills, writing scripts, executing commands, retrying, selecting paths, and making state transitions in several stages.
Therefore, the result of the agent is not just the “last answer”, but the execution path through which it organizes its capabilities into the final result. **
Because of this, output-only eval will soon encounter several fundamental problems.
1. The same results may come from completely different system behaviors.
Just because a task is completed, it does not mean that the skill is valid. It may be that the agent did not use this skill at all, mistakenly used another skill but completed it by chance, the task was too simple and it did not matter whether the skill existed, or there were hidden clues in the environment that allowed the agent to bypass key capabilities.
If you only look at pass/fail, these cases will all be collapsed into “done”.
2. The failure may not be a skill failure.
If a task is not completed, it does not necessarily mean that the skill design is poor. It may also be that it is not retrieved correctly, the skill name description or trigger condition design is incorrect, environmental pollution causes the agent to go astray, there is a midway failure in the trace but is eventually covered up by other noise, or the task itself is judged in an unreasonable way.
output-only eval compresses these completely different failure reasons into a red cross. There is little operational significance in engineering.
3. The team will be kidnapped by “cheap metrics”
The most dangerous thing is that output-only eval is very consistent with organizational inertia. Because it is simple, beautiful, easy to create dashboards, easy to report, and easiest to satisfy managers.
But just because it’s cheap doesn’t mean it’s right.
I feel more and more that the reason why output-only eval is popular is not because it is good enough, but because it is the least troublesome. The problem is that once the team really uses it to guide optimization, it is easy to fall into a very expensive illusion: the score is changing, but the system has not really become clear. **
2. Why is skill evaluation one of the best ways to understand agent eval?
The value of the article “Evaluating Skills” lies in its choice of a particularly good cut-skills.
Why is skill a good incision? Because it naturally forces you to face the complexity of agent eval.
A skill is not an isolated prompt word. Its life cycle involves a complete ability chain: from the triggering mechanism—whether it can be recalled correctly and at what time it is activated; to the understanding process—whether the content is accurately interpreted by the agent; to the execution effect—whether it truly changes the behavior trajectory and whether it makes the agent more efficient or stable; and finally, there is the conflict issue—how compatible it is with other skills.
In other words, testing skill is actually testing:
A method of capability injection, whether it is triggered, understood, and executed in the correct way in a real agent environment, and truly changes the task results.
3. Clean sandbox is not an engineering obsession, but a prerequisite for evaluating credibility.
The article places special emphasis on clean sandbox, which I strongly agree with, and I think its importance can be raised to a higher level.
Because the biggest difference between agent eval and many traditional benchmarks is that the environment is a real experimental variable. **
The same agent, the same skill, and the same task may produce completely different results as long as there is a little residual difference in the environment. These differences include but are not limited to: the last run product remains in the directory, local cache allows dependency installation to bypass certain steps, file system status is inconsistent, tool visibility is different, some default configurations have been modified in the previous round, and hidden fixtures allow the agent to take fewer steps.
These things can also cause problems in human development, but they have a greater impact on agents. Because the agent can easily become overly dependent on accidental cues in the environment, once these accidental cues change, the results will be unstable.
Therefore, a reliable eval harness should at least ensure that: the initial state can be reconstructed, the dependent environment can be controlled, the input workpiece can be repeated, the output path can be checked, and the execution trajectory can be played back.
If these conditions are not met, then the so-called “compare different versions of agents” is essentially running on different mud floors and then pretending to compare shoes.
4. Artifact-based scoring: From “it seems to be done right” to “the system really produces verifiable results”
The point in this article that I most want to encourage engineering teams to adopt is artifact-based scoring.
Why is it so critical? Because once the agent task is slightly more open, the traditional “reference answer comparison” will quickly become ineffective.
For example, for a coding task, the correct result is not necessarily a certain fixed text. It may be that a certain file is created, a certain test passes, a certain schema meets the constraints, a certain evaluator is written, a certain data set is uploaded correctly, and a certain script can indeed be executed.
In these scenarios, the most reliable scoring method is not “does it look like the reference answer”, but “does it produce verifiable artifacts?”
Because it naturally fits the actual working form of the agent: the agent does not just give a reply, it will leave code, files, configurations, logs, test status and structured output. As a result, evaluation has shifted from “subjectively judging output quality” to “checking whether external facts are established.”
5. Invocation, trace and failure taxonomy are where agent eval really begins to mature.
The invocation problem mentioned in the article is, in my opinion, the most underestimated layer of all skill eval.
Because many teams naturally only focus on the success rate of the final task, but ignore a more critical question: **Whether this success is brought about by skill. **
There are at least three classic misjudgments: success without invocation, invocation without contribution, and invocation with collateral damage.
If you only look at pass/fail in this kind of situation, it will be mistakenly classified as “skill taking effect”.
Similarly, the value of trace is not to add icing on the cake, but to take apart the “failure”. Without trace, you can only know: it is not completed, it is very slow, the cost is high, and the output is wrong.
But you don’t know how it went astray. Did you miss the skill, use the wrong skill, read too many irrelevant files, the tool output is too large, or is the environment tripping it up?
Therefore, a mature agent eval should really have three levels of capabilities at the same time: looking at the results, looking at the behavior, and looking at what type of problem the failure belongs to.
I think this is the watershed from “scoring system” to “diagnosis system”.
6. A mature eval harness should ultimately optimize four things at the same time
When many teams do eval, they only focus on one thing in the end: quality score. But in the agent scenario, this is far from enough.
I prefer to break down the goals into four categories:
Quality: Whether the task is done correctly or not.
Efficiency: How many rounds, how much time, how many tool calls, and how many tokens were used.
Reliability: Whether it is stable under the same conditions and fragile under environmental changes.
Interpretability: Whether failure can be explained and success can be attributed.
If one of these four dimensions is missing, the system will be easily misled: if we only look at quality but not on efficiency, the system may be correct but ridiculously expensive; if we only look at efficiency but not on quality, the system may be fast but often bypass key steps; if we only look at quality and efficiency but not on reliability, the system may not be stable in demo production; if we look at the first three items but not on interpretability, the team still doesn’t know how to change it.
So I feel more and more that a truly mature eval harness should not just give scores, but should generate a structured judgment: what is the result, what is the cost, what is the stability, and what are the reasons.
7. When does output-only eval still hold?
In order to avoid reading this article as an extreme assertion that “all reviews must be heavy-duty”, I think it is also necessary to add a layer of boundary awareness.
Output-only eval is still possible if the task has these characteristics: the task is extremely short, the path is very shallow, there are few high-value differences in the intermediate process, the output itself is the main product value, the failure cost is low, and there is no complex skill injection tool chain and environmental coupling.
For example, some lightweight rewriting, format conversion, and single-step generation tasks require complete trajectory evaluation, but the cost may be too high and the benefits too low.
So the truly mature judgment is not “never output-only eval”, but: **Only when process information will significantly change your interpretation of the results, process evaluation must be upgraded to a first-class citizen. **
8. Divide eval into three layers so that the organization will not be dragged astray by a set of indicators.
I increasingly feel that the pain of many teams doing eval comes from their attempts to use a set of evaluation layers to meet three completely different demands at the same time: external reporting, internal diagnosis, and pre-launch gatekeeping.
These three things really shouldn’t look exactly the same. I would rather suggest that the team divide eval into three layers:
Layer 1: summary layer—for management, product side, and cross-team communication. Indicators should be few, stable, and readable, but clearly acknowledge that this is only a summary level.
Layer 2: diagnostic layer - for engineering and eval teams for their own use. The focus is trace, failure slices, invocation, artifact quality, and environment slices.
Layer 3: gate layer - used for judgment before going online or for version promotion. It does not pursue explaining the whole world, but pursues the most critical quality threshold and risk threshold.
Once these three layers are separated, communication confusion in many organizations immediately decreases. Because you no longer try to use one score to complete three completely different things “reporting, diagnosis, and decision-making” at the same time.
Conclusion: A truly mature eval will not only tell the team where the red light is on, but also tell the team what kind of system problem the red light belongs to.
If I were to leave this article with one sentence that is most suitable to end with, it would be:
The sign of a truly mature evaluation system is not whether it can continuously report red lights, but whether it can let the team know whether the red light is a capability problem, a trigger problem, an execution problem, or an environmental problem.
Only by doing this can eval truly change from a “scoring device” to a “system understanding device”.
Who should read this
This article is suitable for the following types of readers:
- The team that is building the agent eval platform
- AI engineer working on skills/prompts/workflow A/B
- The technical leader who wants to transform benchmark from a “reporting tool” into a “diagnostic tool”
- Integrators on LangSmith, OpenTelemetry, internal tracing systems
- People who have been troubled by “the score has increased but the system is not clearer”
Reading path
Continue along this topic path
Follow the recommended order for Eval Harness instead of jumping through random articles in the same topic.
Next step
Go deeper into this topic
If this article is useful, continue from the topic page or subscribe to follow later updates.
Loading comments...
Comments and discussion
Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions