Tag

Agents

English articles and guides tagged Agents.

Eval Harness 3/25/2026

A truly mature Eval Harness will not just focus on the answer

If an eval harness can only tell you the success or failure of a task, but cannot explain whether the agent called the correct capabilities, in what environment it was executed, why it failed, and why it succeeded, then what it gives is not a systematic judgment, but just a score card. This article is based on LangChain's discussion of skills eval and extends my complete understanding of artifact-based scoring, invocation metrics, trace design, workflow eval and evaluation histology.

Evals Agent Skills Langsmith Tracing Agents

Eval Harness 3/25/2026

The most misleading thing about Agent Benchmark is not the model score, but the infrastructure noise.

In agentic coding eval, the model is not the only variable. Resource headroom, kill semantics, concurrency pressure, network status, and sandbox behavior can all change task results. If these conditions are not transparent, small margins on the leaderboard are often less telling than they seem. This article is based on Anthropic's analysis of infrastructure noise and extends my complete understanding of agent benchmark interpretability, disclosure discipline, repeated experiments, and system-level evaluation perspectives.

Evals Infrastructure Benchmark Agents Anthropic

Agent Harness 3/25/2026

What the long-term task agent really lacks is not intelligence, but the handover, recovery and acceptance capabilities.

The failure of long-term task agents often does not stem from the model's inability to think, but from the system's failure to design 'handover, recovery, verification, and continuation' as first-class citizens. This article is based on Anthropic's discussion of long-running agent harness, extending my complete views on cross-session execution, state externalization, feature contract, smoke test, browser verification and multi-round execution structure. It also explains why a truly usable agent does not run for a long time at a time, but can catch it round after round.

Agents Long Running Agents Harness Anthropic Verification

MCP Runtime 3/25/2026

What MCP changes is not tool access, but the cost structure of Agents.

The real significance of MCP is not just to unify tool access, but to move a large number of intermediate processes that should be handled by the runtime out of the expensive LLM cycle. What it changes is not 'how many tools can be connected', but how the agent uses context, code execution and runtime control flow. This article is based on Anthropic's discussion of code execution with MCP and extends my complete understanding of direct tool-calling, progressive disclosure, runtime economics and executable skills.

Mcp Code Execution Context Engineering Agents Anthropic

Agent Harness 3/25/2026

Agent Harness is not a supporting role, but the most underrated main battleground of AI engineering in 2026

What really determines the upper limit of an agent is often not the model itself, but the harness organized around the model. This article is based on LangChain's disassembly of the agent harness, extending my complete understanding of file systems, code execution, context management, verification closed loops and long-term task endurance. It also explains why the focus of AI engineering competition in 2026 is shifting from 'model capabilities' to 'working system design'.

Agents Harness Context Engineering AI Engineering Langchain