Path
Eval Harness
The design and practice of Agent evaluation system, covering process measurement, artifact-based scoring, trace analysis and infrastructure noise control.
-
1. A truly mature Eval Harness will not just focus on the answer
postIf an eval harness can only tell you the success or failure of a task, but cannot explain whether the agent called the correct capabilities, in what environment it was executed, why it failed, and why it succeeded, then what it gives is not a systematic judgment, but just a score card. This article is based on LangChain's discussion of skills eval and extends my complete understanding of artifact-based scoring, invocation metrics, trace design, workflow eval and evaluation histology.
-
2. The most misleading thing about Agent Benchmark is not the model score, but the infrastructure noise.
postIn agentic coding eval, the model is not the only variable. Resource headroom, kill semantics, concurrency pressure, network status, and sandbox behavior can all change task results. If these conditions are not transparent, small margins on the leaderboard are often less telling than they seem. This article is based on Anthropic's analysis of infrastructure noise and extends my complete understanding of agent benchmark interpretability, disclosure discipline, repeated experiments, and system-level evaluation perspectives.