Hualin Luan Cloud Native · Quant Trading · AI Engineering
Back to articles

Article

Agent Harness is not a supporting role, but the most underrated main battleground of AI engineering in 2026

What really determines the upper limit of an agent is often not the model itself, but the harness organized around the model. This article is based on LangChain's disassembly of the agent harness, extending my complete understanding of file systems, code execution, context management, verification closed loops and long-term task endurance. It also explains why the focus of AI engineering competition in 2026 is shifting from 'model capabilities' to 'working system design'.

Meta

Published

3/25/2026

Category

interpretation

Reading Time

22 min read

Original reference: Vivek Trivedy, “The Anatomy of an Agent Harness”. This article is not a translation, but an original interpretation based on the original point of view.

Agent Harness is not a supporting role, but the most underrated main battleground of AI engineering in 2026

If you have been reading articles about agents in the past year, you will find a very subtle but very definite change: everyone is still talking about models, but in their hearts they have begun to turn to systems.

Last year, the focus of many discussions was “Which model is most suitable for agent”, “Which framework is more convenient” and “Is tool use stable?”; this year, people who have actually done a round of production implementation will almost encounter the same fact: the same model, put into different working systems, will behave like completely different things. **

This is also the reason why LangChain’s “The Anatomy of an Agent Harness” is most worth reading carefully. It is not simply introducing a concept, but helping the entire industry recalibrate its perspective: what truly determines the actual capabilities of an agent is never just the model itself, but the layer outside the model that is often overlooked but actually controls the success or failure of the operating structure - that is, the harness.

To put it more directly: in the past year, many teams thought they were optimizing agents; in fact, they were just adjusting models, adjusting prompts, and adding tools. What really needs to be done is how the model is organized, constrained, sustained, verified, and put into work in a real environment.

I feel more and more strongly that the most underestimated main battlefield of AI engineering in 2026 is not “model iteration speed”, but “harness design maturity”.

1. Why when we talk about agents today, we can no longer just talk about models

Models are of course important. Without strong enough reasoning ability, instruction following ability and tool calling stability, it is difficult for an agent to have a decent upper limit. But after actually entering the project site, everyone will soon discover an uncomfortable reality: the strength of the model cannot explain the actual performance differences of most agents. **

To cite a few phenomena that engineering teams are very familiar with: the same model performs very well in one warehouse, but frequently gets lost in another warehouse; the first session is very smart, but the task begins to distort as time goes by; after being connected to more than a dozen tools, it becomes even dumber; a single round of tasks is done beautifully, but after multiple rounds, it starts to “look like a lot of progress, but in fact it is not completed”; it is very brave in the demo environment, but once it reaches production, it continues to expose costs, permissions, recovery and verification issues.

These problems can occasionally be alleviated by “changing to a stronger model”, but they are rarely cured. Because most symptoms do not belong to the pure reasoning layer, but to the working system layer.

Whether the model can think is one question; whether the model has been put into an environment that can continue to work is another question. Today, many teams still instinctively understand agent as “a chat model that calls tools”, which actually lags behind reality.

An agent in reality is more like a carefully encapsulated execution system. It has a memory surface, a file system, a tool discovery mechanism, a code execution environment, permissions and isolation, verification and feedback loops, context compression and recovery mechanisms, multi-round handover capabilities, and possibly even multi-agent division of labor and orchestration.

Without these, no matter how powerful the model is, it will still be more like a smart person without a workbench, notebook, dashboard, or experimental environment. You can get him to answer a few questions, but it’s hard to get him to complete a steady stream of real work.

2. What is a harness: It is not a component, but a layer of working system

The most valuable thing about LangChain’s article is that it pulled the word harness out of its vague state.

When many people hear harness, they mistakenly think it is a controller, a wrapper, or a small module in an agent framework. But if you only understand it this way, you will overestimate it.

To be more precise, harness is not a point, but a layer of operational organizational structure that wraps the model. All capabilities that are “outside of the model ontology but inseparable from the agent” should be viewed from the perspective of the harness.

From an engineering perspective, harness includes at least three levels of capability support.

The first layer is behavior and interaction control: system prompts and rule constraints define the behavior boundaries of the agent, tools and MCP definitions provide interfaces for interacting with the external world, and hooks and middleware allow the system to insert custom logic at key nodes.

The second layer is the execution and storage environment: the file system provides a persistence surface, bash and code execution give the agent real action power, the sandbox and permission control delineate security boundaries, and the browser and UI-level verification allow the agent to perceive and verify the real interface state.

The third layer is cognition and memory management: memory and retrieval solve information recall, compaction and tool-output offloading control context quality, progressive disclosure determines how capabilities are exposed on demand, and logging, tracing, metrics, and verification loops make the entire system observable, debuggable, and optimizable.

These three layers together form a complete harness.

Seeing this, many people will realize one thing: this is actually very close to software system design, not just prompt design.

This is why I dislike the expression “agent engineering = writing prompt words” less and less. That kind of statement is only suitable for the demo era, not now. Today’s agent engineering is more like putting a reasoning core into a runnable, recoverable, scalable, and verifiable working environment.

In other words, the core model is the brain, and the harness is the nervous system, memory system, hands and feet, workbench, and safety guardrails. **

The brain alone can of course explain problems; but it cannot do engineering.

3. Why does the file system become the core infrastructure of the agent again?

The thing I agree with most about this article is that it elevates the file system very high. Many teams give priority to model routing, tool catalog, planner prompt, and memory database in agent design. They all seem advanced, but after running long-term tasks, the simplest file system often becomes the fulcrum of the entire system again.

Why? Because the file system naturally solves a problem that large models have not been able to solve: ** turning fragile, temporary, and expensive context into stable, readable, and shareable external facts. **

1. File systems are the cheapest long-term memory

Model contexts are expensive and perishable. As the session gets longer, old information and new information begin to cover each other; with too many intermediate steps, really important constraints are easily buried. Of course you can do memory injection, vector recall, summary stitching, but in many cases, a plan file, a progress log, and a stage product document are more stable than any fancy memory pipeline.

The reason is simple: files are readable, writable, versionable, and reviewable. They are naturally suitable for handovers and will not be “attention diluted” due to scrolling conversations.

The stability of an agent that writes the task process back to a file and an agent that leaves everything in context is often not the same order of magnitude.

2. The file system allows multiple rounds of tasks to truly “have a place to land”

Many agent systems appear to be working, but are actually just generating text continuously. It does a lot of reasoning, gives a lot of plans, and outputs a lot of intermediate analysis, but these things don’t really fall on a workbench, so the next round can’t be picked up, and people can’t pick it up.

This is the value of a file system: it provides a real external surface to tasks. Research results can be dropped into findings.md, plans can be dropped into task_plan.md, progress can be dropped into progress.md, code can be written directly into the warehouse, running results can be dropped into logs, failure reasons can be turned into error records, and the next round of agents can continue to read them directly.

This is why I feel more and more: **agent’s “memory” should not be mainly understood as something inside the model’s brain, but should be understood as an external working memory layer with the file system as the core. **

3. Git upgrades file systems into recovery infrastructure

The file system itself is already very important. Once combined with git, it is immediately upgraded from “a place where things are stored” to a “recovery and audit system”.

This is too critical for the agent. Because the agent will naturally make mistakes in its attempts, and may even make a series of wrong actions when you are not paying attention. Without git, you can just “hope it doesn’t break”. With git, you at least have rollback capabilities, visualization of differences, staged delivery, timelines for multiple rounds of handovers, and auditable records of failed attempts.

Many people regard git as a development collaboration tool, but in the agent scenario, it is more like a running security device.

Therefore, the most effective agent harness today is often not one that is “better at thinking”, but “better at writing back files, leaving history, and allowing the next round to continue.”

4. Why code execution is not an additional capability, but the real source of action for the agent

If the file system solves “memory and externalization”, then code execution solves another more fundamental problem: how does the agent obtain general mobility. **

Many teams like to design a lot of specialized tools from the beginning, which certainly helps. Special tools make permission boundaries clearer, actions more controllable, and calls easier to monitor. But as soon as the task becomes even slightly more complex, you run into problems that the tool design cannot exhaust.

Tasks in the real environment change too quickly: the data format changes temporarily, additional cleaning is required, the output of two tools needs to be combined, a short script needs to be written to verify the hypothesis, files need to be processed in batches, and aggregation, filtering, polling, and retry need to be performed temporarily.

If everything had to be prefabricated with a tool, the system would quickly become a fragile museum of capabilities. Instead, giving the agent a surface on which to run code safely gives it the ability to build its own gadgets at the execution site.

This is why I strongly agree with the article’s judgment: **bash/code execution is not an optional add-on, but a key step for the agent from “can adjust functions” to “can solve problems”. **

1. Fixed tools provide a capability list, while code execution provides a capability amplifier.

Fixed tools are like menus and code execution is like kitchen. No matter how rich the menu is, you will still encounter problems outside the menu; only with a kitchen can you truly assemble it on the spot.

This does not mean that the agent should write code to do everything, but it means that once the scene is complex, the path is not fixed, and the results require further processing, code execution will make the entire system much more flexible.

2. A truly mature harness does not hand over everything to the model, but moves the programmed work back to the program.

A lot of work that seems to be “done by the model” now should not have been repeatedly participated by the model. For example, batch filtering of data, format conversion, result merging, polling waiting, failure retry, output compression, intermediate value desensitization - if these things are orchestrated through the model in turn, it will not only be expensive, but also fragile.

A more reasonable approach is to let the model decide what to do, let the code environment do the mechanical part, and then send the truly valuable summary back to the model.

This is not to weaken the model, but to return the model to a position more suitable for it: ** Make high value judgments, and do not repeat porters. **

5. Why did the sandbox change from a “security appendage” to the core part of the harness?

Once you accept that agents should have greater mobility, another question will surely arise: How to manage boundaries?

When many people discuss sandbox, they always have a sense of “additional requirements for the security team”, as if it is mainly for compliance. But in the agent scenario, the sandbox is not just a security facility, it itself is an integral part of product capabilities and engineering stability.

Why do you say that? Because there is no sandbox, code execution capabilities will quickly become out of control. You have to answer: Where to do it? What directories can I access? Can I connect to the Internet? Can I pretend to be dependent? How long can you keep running? What is the resource limit? How to clean up after an error? Can it be audited?

These problems look like operation and maintenance problems, but in fact directly affect the availability of the agent. An agent without clear boundaries is often not “more flexible”, but “harder to use, harder to trust, and harder to recover”.

This is why I am increasingly inclined to regard sandboxes as first-class citizens of harnesses. Like code execution, file system, and verification, it is not an independent component, but different aspects of the same running design.

To put it more bluntly: **If an agent can take action but has no boundaries, that is not a capability, but an accident reserve. **

6. Context rot is the real invisible cause of death in many agent systems.

If you asked me what was the most underrated agent problem of the past year, I would probably answer: context rot.

Many failures look like reasoning errors, planning errors, and tool selection errors on the surface. But if you dig deeper, you will often find that the real problem is not that the agent is incompetent, but that the agent’s working context has been broken.

Context ages. It is not a static container but a dynamic accumulation of increasingly chaotic layers. When the task becomes longer, you will find that the context begins to “deteriorate”: old goals remain in the conversation like ghosts, and new constraints try to cover them but fail to cover them completely; tool output is piled up layer by layer, and failed paths in the middle are not cleaned up; truly important information is drowned in the torrent of minor details, and the model begins to rely more on locally conspicuous information and gradually forgets the global goal.

At this time you will see a very typical illusion: the agent is still “working hard” and can even continue to output things that look reasonable, but it is getting further and further away from the goal.

I think the compaction, tool-output offloading, and skills/progressive disclosure mentioned in the LangChain article are essentially solving the same problem: **Making the context available again. **

1. Compaction is not a summary, but an attention reconstruction

Many people understand compaction as “compression dialogue”, which is too light. Truly effective compaction does not shorten, but reorganizes the focus of attention.

Good compaction requires answering: What is the current real goal? What attempts have failed? Which constraints must be preserved? What details should be taken out of context? Where makes the most sense to continue the next round?

If you just shorten the original words, that’s not called compaction, it’s just text slimming.

2. Tool-output offloading is the new “log design” in the agent era

The output of large tools is directly injected back into the model, which is one of the largest sources of hidden costs in many systems today. The result is not that the model cannot understand, but that the model is forced to consume its attention to see too much content that should not occupy the context at all.

Therefore, more and more mature systems are doing the same thing: putting large results into disk, logs, and files, and only bringing necessary summaries back into context.

The idea behind this is very traditional, even a bit like old-school software system design:

The hot path leaves only high-value information and original data in external systems.

The only difference is that the “hot path” now becomes prompt context.

3. Progressive disclosure solves “excessive capability injection”

Many teams initially like to stuff all capability descriptions, all rules, and all tool definitions into the model at once, thinking that this is “more complete.” But the reality is exactly the opposite: the more you feed everything in advance, the easier it is for the model to lose focus when it is actually executed.

The value of Progressive disclosure is that it acknowledges a basic fact: **The context is not an encyclopedia, but a working memory. **

The key to working memory is not to be as comprehensive as possible, but to be as relevant as possible.

7. The real agent endurance comes from the recoverable multi-round structure, rather than the single-round super god.

One thing that particularly resonates with me about LangChain’s article is that it tacitly accepts a reality: long tasks can never be solved by one reasoning, but a process that spans multiple rounds, multiple stages, multiple artifacts, and multiple failure recovery.

This seems like common sense, but many agent products are still betting on one thing today: **Hope that a session can last long enough, the model is smart enough, and the context is large enough, so that complex tasks can be completed all the way. **

I feel more and more that this path will most likely continue to disappoint many teams. Because complex tasks are not “longer questions and answers” but “more like project execution”. Since it is like project execution, there must be phase division, external artifacts, failure recovery, work handover, goal reaffirmation, result verification, and boundary condition update.

In other words, the real battery life comes from multi-wheel structural design rather than single-wheel magic.

A mature harness should at least make these things natural: what to do in the current round, how to connect to the next round, what to read first when connecting, how to roll back after failure, which states must be written back to the outside, and under what conditions the mark is allowed to be completed.

This type of design doesn’t look “cool”, but they determine whether the agent is a demo or a worker.

8. Why do I think the strongest team in the future will not separate models and harnesses?

There is still an obvious misalignment of division of labor in many companies: the model team is looking at benchmarks, the application team is stacking prompts, and the platform team is supplementing tools and permissions. In the end, everyone feels that what they are doing is part of the agent, but no one really has the issue of “overall agent system performance.”

This will bring a very bad result: everyone is optimizing local parts, but no one is responsible for the overall stability.

The real value of the harness perspective is to bring these scattered issues back together. It reminds us: prompt is not an independent variable, tool catalog is not an independent variable, memory is not an independent variable, sandbox is not an independent variable, and eval is not an independent variable.

These things together form a working system. You can’t just max out one of them and expect the agent to naturally become stronger.

I increasingly believe that the competitiveness of the strongest agent teams in the future will not come only from “getting the latest model”, but from how well they combine the following things: whether they know which things should be left to the model and which should be handed over to the system; whether they have a stable external memory surface; whether they make verification a hard constraint instead of a polite action; whether they can allow the agent to continue to maintain a sense of direction during long tasks; whether they can recover after failure instead of starting all over again; whether they can control context debt instead of constantly piling up context.

If these are done well, the system can be very strong even if the model is not up to date. On the contrary, no matter how top-notch the model is, the harness will be a mess, and in the end it will become “occasionally stunning, often out of control.”

9. The most worthwhile thing to take away from this article is not the terminology, but the way to judge the problem.

I think after reading “The Anatomy of an Agent Harness”, the most useful thing is not to remember what components are in the harness, but to learn to look at the agent problem in a different way.

In the future, when an agent system performs unstable, your first reaction should not be just “The model is not strong enough? The prompt is not good enough? The tool description is not detailed enough?”, but you should first ask: Does it have a clear external working memory? Is the context tainted by big results? Does executing actions rely too much on the model to gradually orchestrate? Is there a safe and recoverable code execution environment? Does the task lack a phased handover point? Does the completion standard not really bind the verification? Is the current failure a problem of reasoning, or is it the harness that makes it destined to fail?

This set of questions is very important because it will pull you out of the inertia of “continue to fiddle with model parameters” and move you to a truly more leveraged position.

10. My judgment on the agent project in 2026: the focus will continue to shift from models to working systems

I don’t think models are unimportant. On the contrary, models will only continue to get stronger, and reasoning capabilities, long context, and tool usage capabilities will continue to improve. But precisely because the model is getting stronger and stronger, the importance of harness will be further exposed.

The reason is simple: when base model capabilities are close, differences in system design are magnified. Just like when the level of programmers is close, the importance of project engineering organization skills will rise rapidly.

So my strong judgment for the next one or two years is:

**The competition in AI engineering will become more and more like “who is better at building a working system” rather than “who is better at shouting model slogans”. **

What can really advance an agent from “looking smart” to “worthy of long-term trust” is often not a stunning prompt, a new framework, or a lead in a certain benchmark, but these seemingly unsexy infrastructures: file system, code execution, sandbox, context engineering, verification loop, durable memory, orchestration discipline, and recoverable multi-round execution structure.

These things put together make up the harness.

11. Why do many teams still underestimate harness?

If we still need to keep a supplementary paragraph before the end, I think the most worth keeping is this layer: **It’s not that many teams don’t know the importance of harness, but that organizational incentives naturally make it easier to overestimate the model and underestimate the system. **

The reasons are usually not complicated: demos are more rewarding for “looking smart” performance, benchmarks are more likely to amplify model differences, models, platforms, and products are separated by different teams. Buying models is easier to tell a story than building a working system, and it is easier to get approval.

Therefore, harness often fails not because no one mentions it, but because no one really owns it and no one is willing to invest in advance for its long-term gains.

Conclusion: When you say you are working as an agent, in fact, what you are really working on is a working system.

The reason why “The Anatomy of an Agent Harness” is worth reading is not because it proposes a new buzzword, but because it clearly explains a change that has already occurred at the engineering site.

When talking about agents today, if we only talk about models and tools, it is actually no longer enough. A more accurate statement should be: when you are working as an agent, what you really build is not a talking model enhancer, but a working system that revolves around the model.

If the system is strong, the model will be amplified; if the system is weak, the model will be dragged down.

So if you ask me, what is the most memorable sentence in this article, I would rewrite it like this:

Future agent projects will not lose in “model intelligence” itself, but more likely in “whether the intelligence is put into a system that can work stably.”

That’s what harness is about.

It’s not a supporting role. It is the main battlefield.


Who should read this

This article is more suitable for the following types of readers:

  • The engineering team that is pushing the agent from demo to production
  • AI engineers who already feel that “the model is obviously very strong, but the system is still unstable”
  • People who are doing Claude Code, Cursor, OpenHands, Devin workflow design
  • The platform team that designs MCP, skills, tool runtime, and agent orchestration
  • People who have personal experience of “why agents become distorted as time passes”

Reading path

Continue along this topic path

Follow the recommended order for Agent Harness instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Return to topic Subscribe via RSS

RSS Subscribe

Subscribe to updates

Follow new articles in an RSS reader without checking the site manually.

Recommended readers include Follow , Feedly or Inoreader and other RSS readers.

Comments and discussion

Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions

Loading comments...