Hualin Luan Cloud Native · Quant Trading · AI Engineering
Back to articles

Article

What the long-term task agent really lacks is not intelligence, but the handover, recovery and acceptance capabilities.

The failure of long-term task agents often does not stem from the model's inability to think, but from the system's failure to design 'handover, recovery, verification, and continuation' as first-class citizens. This article is based on Anthropic's discussion of long-running agent harness, extending my complete views on cross-session execution, state externalization, feature contract, smoke test, browser verification and multi-round execution structure. It also explains why a truly usable agent does not run for a long time at a time, but can catch it round after round.

Meta

Published

3/25/2026

Category

interpretation

Reading Time

17 min read

Original reference: Justin Young, “Effective harnesses for long-running agents”. This article is an original interpretation, not a literal translation.

What the long-term task agent really lacks is not intelligence, but the handover, recovery and acceptance capabilities.

If you have seriously messed with long-term task agents in the past six months, you will most likely have a very familiar feeling of collapse: it is not completely unable to do it, and it often even does something like that at first, but as soon as the task is extended, the session is switched, and the context is reset, the entire system begins to loosen up.

You will see some extremely typical scenarios: you feel like a genius in the first session, and like an intern in the second session; you have clearly written a lot of code, but you ask “Is it completed now?” and no one can tell clearly; the task list seems to have been advancing, but in the end, the acceptance that really needs to be done is not done; there are a lot of code changes, but you can go through the whole process in one go, and what you did before can’t be connected at all; every round of agents works hard, but the entire project does not form a sustainable accumulation structure.

For this kind of problem, many people’s first reaction is to blame the model: insufficient context, unstable reasoning, and incorrect use of tools. But the most valuable thing about Anthropic’s “Effective harnesses for long-running agents” is that it pushes the problem one level further: the core challenge of long-term task agents is not at all “making the model work for longer at a time”, but “how to make the work catch up between multiple rounds.” **

I can almost regard this as one of the most important realistic judgments in this wave of agent projects. Because today, a large number of systems still misunderstand “long-term tasks” as “ultra-long single rounds.” When you actually enter the project site, you will find that complex tasks are never completed in one go. It is more like project execution—it must be able to be divided, restored, handed over, accepted, and continued.

So, if I had to condense the core idea of ​​this article into one sentence, I would write it like this:

**The competition among long-term task agents is not about whose one round is more powerful, but whose multi-round structure is more stable. **

1. Why does the long-term task agent repeatedly die at the “looks like it’s almost successful” step?

What touches me most at the beginning of this article is not that it gives a new framework, but that it accurately describes a type of failure mode that everyone has seen but is difficult to summarize clearly: the agent seems to have been working, or even “making progress”, but in the end you will find that the system has not really completed the task.

This is actually a very dangerous illusion.

Because explicit failures are easier to deal with: error reports, exits, crashes, and insufficient permissions are all easy to identify. The most terrifying thing is the kind of false success: the page seems to have been modified, the file seems to have been written, a certain test seems to have passed, the backlog seems to have moved forward, and the agent himself is very confident.

But as long as you check from a different perspective, problems emerge: the real user path is not run, multiple rounds of context are not caught, previous failed attempts are not recorded, the current changes damage other parts of the system, and “completion” only exists in the agent’s narrative and does not exist in verifiable facts.

This is the biggest difference between a long-term task agent and a single-wheel assistant. If you answer wrong in a single round, you usually only lose one output; if you continue to advance in a long-term mission while making mistakes, you will eventually accumulate a whole ruins of superficial prosperity.

I feel more and more that the biggest disadvantage that many teams suffer today is not that the models are too weak, but that the system is too easy to create this illusion of “approximate completion.”

2. Long-term tasks are not a long-context problem, but a long-cycle execution problem.

When many people see this kind of problem, their first reaction is to continue to add context: use a longer context window, retain more historical messages, let the agent remember more intermediate processes, and try not to reset the session.

These practices are sometimes helpful, but I think they often miss the point.

Because the real difficulty of long-term tasks is not “remembering more”, but “how to keep the work structure consistent across rounds.”

This is why I agree with the implicit judgment in the article: long-term tasks are closer to long horizon execution rather than long context prompting.

The difference between the two is very big.

If you understand the problem as long context, your thinking will usually be: how to keep more things in the context, how to prevent the model from being forgotten, how to reduce session switching, and how to extend an execution cycle.

But if you understand the problem as long horizon execution, your thinking will become: How to divide large tasks into stable small stages? How to precipitate the current state into a readable artifact? How to quickly locate the current situation in the next round? How to make the boundaries of each round clearly visible? How do you leave a clean handover at the end? How do you really tie validation to the definition of “done”?

The former way of thinking is trying to extend a period of thinking; the latter way of thinking is trying to design a sustainable work system.

I think the latter is the really promising direction.

3. The split of initializer agent / coding agent, why it seems simple but is extremely important

There is a model in the article that I agree with very much: splitting the work of entering the project for the first time and the subsequent work of continuing to promote functions into different roles.

Many people will underestimate this matter and think it is nothing more than a TODO list with nothing to emphasize. But in my opinion, this is almost the core watershed that determines whether long-term task agents can truly continue to run.

Because what it solves is not “listing a few to-do items”, but building a feature contract.

The so-called contract does not mean that it looks like a document, but that it assumes the role of contractual constraints: which features are counted within the scope of the task, what are the completion standards for each feature, which features are currently unavailable, which features have passed verification, and whether unauthorized rewriting of definitions is allowed in subsequent rounds.

Without this contract, tasks can drift quickly. The agent reinterprets the requirements in each round based on what it sees at the moment. The system seems to be advancing, but in fact it keeps changing the topic.

Why do many agent projects end up “doing a lot, but not doing it right”?

Because they never have a stable feature contract. As a result, the requirements boundary becomes increasingly blurred, the definition of completion becomes more and more subjective, the agent rewrites the problem in its own language every round, and it becomes increasingly difficult for the team to know what is really completed.

Why is JSON the format more suitable for agents?

The article mentioned that JSON is more stable than markdown, and I strongly agree. It’s not that markdown is bad, but in the scenario of continuous reading and writing by agents, structured fields really have huge advantages.

Markdown is suitable for human reading, while JSON is more suitable for continuous updating and verification by machines. Especially when you need to clearly mark fields such as feature_name, description, verification_steps, status, and passes, stable data structure is far more important than “looking good”.

I even want to go one step further: Many mature agent workflows in the future will become more and more like a combination of “code + contract + verification artifacts” instead of just “chat + tool invocation”.

4. The true meaning of progress file: not to keep a diary, but to reduce the cost of cognitive reconstruction

When the article mentioned progress files like claude-progress.txt, I think it hit on a very real but often ignored problem in long-term tasks: **Every time you take over a task again, the biggest waste is not writing code, but rebuilding understanding. **

After one round of agent work, if nothing is written back, the next round will have to figure out again: what is the current status of the warehouse, what was done in the previous round, which files were changed, which attempts failed, what is left unfinished, and what is most worth doing next.

Each round of these questions relies on code diff and session residue to guess, which is very costly and has a large error.

A good progress file is essentially a “handover note” for future agents, rather than a souvenir book for the past. Its real function is not to record history, but to compress takeover costs.

The most important thing about a progress file is not that it is complete, but that it is of high value and can be continued

I think many teams will make a mistake here: thinking that the progress file should be very detailed, listing everything like a daily report. Not really. For long-term task agents, the progress file is more like a high-value handoff memo. It should give priority to retaining the following information: the most reliable current system status judgment, what has been completed in this round, which attempts have failed, what are the reasons for failure, where should not repeat pitfalls, and which feature should be started in the next round.

These contents are more valuable than “how many commands were run today and which paragraphs were edited.”

Because of this, I increasingly feel that the most useful form of context compression is not to summarize the chat history, but to directly precipitate the truly useful status into files.

5. Why “only do one feature per round” is the most important scope discipline in long-term tasks

My favorite piece of advice in the article is to only work on one feature at a time in subsequent coding sessions.

Many people will feel conservative or even a little slow when they first see this article. After all, since it is an agent, why not let it do more things and increase throughput?

But after you actually do it for a few rounds, you will find that this advice is almost a survival rule for long-term task agents.

Because once the agent promotes multiple goals at the same time in one round, it will contaminate several layers of state at the same time: the context is interleaved by multiple tasks, the verification scope becomes blurred, and the submission boundary becomes dirty. After an error occurs, it is difficult to know which step caused the system to break. It is also difficult for the person taking over in the next round to know what to continue at the moment.

Single feature is not to be slow, but to make progress real and cumulative

The benefits of completing a feature at once are very specific: work boundaries are clear, success criteria are clear, verification paths are controllable, git diff is easy to understand, rollback is easier, and handover is cleaner.

This is actually the logic of “small step iteration” in classic software engineering, but the agent needs it more. Because human developers still have a stable global mental model even if they deal with multiple problems at the same time; the agent does not have such a strong ability to continuously grasp the overall situation, so it needs strong constraints to help it stay on track.

One of the biggest misunderstandings many teams have about agents is that the more autonomous they are, the more suitable they are for taking over everything. In fact, it’s just the opposite: the more autonomous an object is, the more scope discipline is needed. **

6. What really makes the word “complete” valid is not modification, but acceptance.

What particularly resonates with me about this article is its vigilance against the word “done”.

In the agent context, “done” is one of the most abused words. Because the agent is very good at giving a narrative sense of completion: the function has been implemented, the page has been modified, the test has passed, the logic has been supported, and the next step is recommended.

The problem is that this sense of completion is often just narrative completion, not verified completion.

The two suggestions given in the article - run smoke test before starting and do browser verification before completion - in my opinion are not detailed suggestions, but the core principles of long-term task harness.

1. Do a smoke test before starting. The essence is to confirm that “you are connected to a live system”

Before starting a new job in each round, it is very important to confirm that the system is still alive. Otherwise you are likely to continue to pile code on an environment that has quietly broken.

The reason why many agent projects get worse and worse is not because the subsequent functions are written incorrectly, but because they continue to build new work on top of the bad state. Once you don’t confirm the baseline health first, all subsequent “new developments” may be decorations built on the floating ground.

2. Browser verification is not the icing on the cake, but turns “users can really use it” into a verifiable fact

Especially in web scenarios, unit test and local commands alone are not enough. The real user path is often stuck in the component combination layer, routing jump layer, loading timing, browser behavior, and real click path.

If browser-level verification is not done, “completion” is often only done at the code level, not the product level.

This is why I tend to think of verification more and more as a hard constraint on the harness, rather than a polite step at the end of development. You can skip polite steps, but not hard restraints.

7. A truly usable long-term task agent must incorporate “recovery capabilities” into its design.

I think the most underestimated aspect of this article, but actually the most important, is that it implicitly puts “recovery” at the center of the system.

When many people design an agent, they focus mainly on: how to make it right, how to make it faster, how to make it use more tools, and how to make it report fewer errors.

Of course these are important. But what really determines system availability for long-term tasks is often another thing: whether you can get back on track after making an error. **

Because it is impossible for long tasks to be error-free. In reality, you will definitely encounter: context drift, misjudgment of task scope, bad writing of a file, sudden unhealthy environment, failure to call tools, and verification that there are problems ahead.

At this time, a truly mature system will not regard “don’t make mistakes” as the core fantasy, but regard “can recover after making mistakes” as the design premise.

So you will find that the seemingly simple things in the article—git commit, progress file, feature JSON, smoke test, browser verification—actually serve recovery capabilities.

Together they constitute not a fancy agent demo, but a set of recoverable workflow.

8. A truly mature long-term mission system. The final battle is not who can keep running, but who can stop at the right place.

Many people still think of long-running agents as “the more they can run continuously, the better”. But from an engineering perspective, I feel more and more that a truly mature system is not the one that runs all the time, but the one that stops at the appropriate boundary.

This is because: if the verification is not stopped, errors will accumulate; if the handover is not stopped, the status will be chaotic; if the rollback is not stopped, the scope of the accident will be expanded; if the handover is not stopped, the repair cost will be increased.

Therefore, the maturity of long-term task agents is ultimately not “continuity is infinite”, but “continuity and boundary exist at the same time”.

This is particularly similar to the way mature engineering teams work: a truly strong team is not a team that keeps running wildly, but a team that knows when to continue, when to verify, when to divide, and when to stop and hand over.

I think agents will increasingly evolve in this direction.

9. What should the minimum configuration of a handover agent workflow look like?

If all the suggestions in this article were compressed into a truly implementable minimum configuration, I would write it down into the following five things. It’s not because these five things are perfect, but because without any one of them, the long-term task system can easily slip back into the “busy but not accumulated” state.

First, task contract - lock the scope first. Without a contract, all subsequent subsequent executions will continuously rewrite the problem.

Second, progress artifact - not to record history, but to let the next round know where we stand now.

Third, feature state - Whether each feature is currently unfinished, in progress, partially passed, or completed and verified must have a clear status.

Fourth, validation step - before the end of each round, real verification traces must be left, instead of just leaving the agent’s narrative.

Fifth, clean handoff boundary - Each round must have a clear stopping point: where to enter the next round, what to look at first, and what to do most.

I like to use the term “minimum configuration” because it reminds the team that the key to continuous execution is not to install the most complex system first, but to not lack those basic parts that will overturn repeatedly if they are missing.

Conclusion: A truly mature long-running agent is not an agent that is always running, but an agent that can always clearly explain the boundaries of its work.

If I were to leave this article with one sentence that is most suitable to end with, it would be:

Whether a long-running agent is worthy of trust in the end does not depend on how long it has been running, but on whether it can explain the current work boundaries clearly enough at every handover point, so that the next round can continue without having to guess.

I think this is where the long-term task agent will really widen the gap: not the continuous output itself, but that the handover is never lost in the continuous output.


Who should read this

This article is suitable for the following types of readers:

  • Engineering teams who are trying to get an agent to do tasks continuously for hours or across multiple sessions
  • People who are designing Claude Code / Cursor / OpenHands-like coding workflow
  • People who have been tortured by “it looks like it’s almost done but it’s not done yet”
  • People who want to push the agent from demo to real engineering collaboration
  • AI engineers who have a strong sense of pain in task handover, recovery, and design verification

Reading path

Continue along this topic path

Follow the recommended order for Agent Harness instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Return to topic Subscribe via RSS

RSS Subscribe

Subscribe to updates

Follow new articles in an RSS reader without checking the site manually.

Recommended readers include Follow , Feedly or Inoreader and other RSS readers.

Comments and discussion

Sign in with GitHub to join the discussion. Comments are synced to GitHub Discussions

Loading comments...