Tag

Anthropic

English articles and guides tagged Anthropic.

Eval Harness 3/25/2026

The most misleading thing about Agent Benchmark is not the model score, but the infrastructure noise.

In agentic coding eval, the model is not the only variable. Resource headroom, kill semantics, concurrency pressure, network status, and sandbox behavior can all change task results. If these conditions are not transparent, small margins on the leaderboard are often less telling than they seem. This article is based on Anthropic's analysis of infrastructure noise and extends my complete understanding of agent benchmark interpretability, disclosure discipline, repeated experiments, and system-level evaluation perspectives.

Evals Infrastructure Benchmark Agents Anthropic

Agent Harness 3/25/2026

What the long-term task agent really lacks is not intelligence, but the handover, recovery and acceptance capabilities.

The failure of long-term task agents often does not stem from the model's inability to think, but from the system's failure to design 'handover, recovery, verification, and continuation' as first-class citizens. This article is based on Anthropic's discussion of long-running agent harness, extending my complete views on cross-session execution, state externalization, feature contract, smoke test, browser verification and multi-round execution structure. It also explains why a truly usable agent does not run for a long time at a time, but can catch it round after round.

Agents Long Running Agents Harness Anthropic Verification

MCP Runtime 3/25/2026

What MCP changes is not tool access, but the cost structure of Agents.

The real significance of MCP is not just to unify tool access, but to move a large number of intermediate processes that should be handled by the runtime out of the expensive LLM cycle. What it changes is not 'how many tools can be connected', but how the agent uses context, code execution and runtime control flow. This article is based on Anthropic's discussion of code execution with MCP and extends my complete understanding of direct tool-calling, progressive disclosure, runtime economics and executable skills.

Mcp Code Execution Context Engineering Agents Anthropic