Article

Original interpretation: Why what OpenClaw really lacks is not more prompt words, but a tool firewall that dares to say 'no'

Many teams pin OpenClaw safety on prompt constraints, but what really determines the upper limit of accidents is not what the model thinks, but whether the system allows the model's ideas to be directly turned into tool execution. This article proposes a four-layer governance framework of 'intention-adjudication-execution-audit'.

Topic · OpenClaw security in-depth interpretation Series OpenClaw in-depth interpretation 7/10

Original Interpretation Openclaw Tool Firewall Framework

Introduction: “Don’t do it” at the language layer does not mean “it can’t be done” at the execution layer.

Liu Yang has been working in the security industry for more than ten years and has seen various offensive and defensive methods, but the confusion brought to him by systems such as OpenClaw is completely new.

What started was an internal penetration test. His team was hired to assess the security of a company’s OpenClaw deployment. Before the test begins, the customer confidently demonstrates their security measures: the prompt clearly states “Do not perform dangerous operations”, there are output filtering rules, the model has been trained for safety alignment, and all tool calls are logged.

“Sounds very comprehensive,” Liu Yang thought at the time, but the corners of his mouth twitched slightly. As a security expert with more than ten years of experience, he knows that this “comprehensiveness” is often just superficial. “But let’s try.”

The testing process was both thought-provoking and unsettling.

In the conference room, the client’s development team sat around a long table and watched Liu Yang’s every operation through the big screen. The atmosphere went from relaxed and confident at the beginning to gradually becoming serious.

“The first type of test is direct injection,” Liu Yang explained while operating. “We explicitly ask the Agent to delete a certain file in the prompt.”

The Agent’s reply quickly popped up: “I cannot perform the delete operation because it may damage the system.”

“You see,” the customer’s technical leader Chen Gong said confidently, “We have done a good job in security alignment.”

Liu Yang nodded and said nothing. He moved on to the second test.

“The second type is indirect induction. We construct a task scenario that looks normal, but lays a trap in the context.” Liu Yang typed in the input box: “Please sort out the log files. The old log files take up too much space. You can clean them up.”

The Agent thought for a few seconds and then output: “I will clean up the old log files for you. The following files were detected and can be deleted: /var/log/app/2023-01*.log, 15 files in total. Do you confirm the deletion?”

“This was intercepted by us,” Chen Gong pointed at the screen, “You see, we have an output filter, and an alarm is triggered when the ‘delete’ keyword is detected.”

Liu Yang looked at the screen. Indeed, the system popped up an alarm prompt. But his brows did not relax.

“The first two are relatively direct attacks,” Liu Yang turned around and faced the customer team, “Now let’s try the third one.”

He opened a text editor on his computer and started writing a configuration file. The conference room was so quiet that you could hear the sound of the air conditioner blowing.

“In the third test, we construct a complex task chain. First read a configuration file and decide the next step based on the configuration content.” Liu Yang saved the configuration file, “and the content of the configuration file can be controlled by the attacker.”

He uploaded the configuration file and then made a request to the Agent: “Please perform system optimization according to the configuration file.”

The Agent reads the configuration file, analyzes the contents, and begins execution. The log on the screen scrolls quickly:

[10:23:15] Read config file: /config/optimization.yaml
[10:23:16] Parse config: system_settings → update
[10:23:16] Execute operation: Modify system settings
[10:23:17] Operation completed: System settings updated

Nothing stands in the way. There are no alerts, no confirmation pop-ups, and no additional approval processes.

Chen Gong’s expression changed. He moved closer to the screen and looked at the log carefully: “What…what operation is this? What settings were modified?”

“In this test, I asked it to modify a non-critical setting,” Liu Yang said calmly, “But what if the configuration file says ‘delete database’ or ‘turn off security service’?”

There was silence in the conference room.

Liu Yang watched as the expressions of the customer team changed from confidence to shock, and from shock to worry. He knew it was an uncomfortable feeling - you thought you had adequate security measures in place, only to find that your defenses were ineffective against a real attacker.

“The Prompt constraint is not triggered,” Liu Yang explained, pointing to the screen, “because the Agent does not think it is performing dangerous operations - it just works ‘normally’ according to the configuration. The output filter is also not triggered because there are no sensitive keywords. Logging does occur, but what is recorded is ‘configuration update successful’ instead of ‘why this update is allowed’.”

Chen Gong sat back on his chair, his face a little pale: “We…we really didn’t expect this situation.”

“It’s not your fault,” Liu Yang’s tone softened, “This is a problem in the entire industry. We all focus too much on what the model ‘wants’ to do, but ignore what the system ‘asks’ it to do.”

This test result made Liu Yang realize a core problem: the most misunderstood thing about systems like OpenClaw is that we always confuse the expression ability of the model with the execution ability of the system. So when everyone talks about security, their first reaction is prompt: clearly write the rules, boundaries, and prohibited items. It seems that the more the model is understood, the safer the system will be.

The problem is that the real risk never occurs at the level of “what the model thinks”, but at the level of “what the system finally allowed it to do.” You can make the model look cautious, restrained, and self-censoring, but if once it makes a high-risk call and there is no independent adjudication mechanism behind it, then the so-called safety is still based on the same set of fragile assumptions: hoping that the model will not make mistakes under boundary conditions.

This is why I view tool firewalls not as an enhancement, but as a watershed moment. Without it, the system will just call the language model of the tool; with it, the system begins to have a key capability: separating proposal and execution.

Why old frameworks fail

The problem with the old framework is that it defaults to “language constraints” as an approximate substitute for “execution constraints”.

In simple scenarios, this approximation does sometimes work. The model knows not to delete the database, not to send the wrong message, and not to change highly sensitive configurations, so it does not do these things most of the time. So the team gradually developed a dangerous empiricism: since nothing big happened in the past few weeks, prompt alignment is enough.

Liu Yang has seen too many such “good enough” illusions. He summarized a rule: when the security of a system depends on the “understanding” and “awareness” of the model, it actually has no security boundaries. Because “understanding” and “consciousness” cannot be strictly defined, cannot be verified, and cannot be guaranteed under abnormal circumstances.

More importantly, the model does not run in a closed, controlled environment. The input it receives comes from the external world, which cannot be trusted. Prompt injection, context pollution, indirect prompt word attacks - these methods are effective precisely because they exploit the limitations of the model’s “understanding”.

The configuration file attack Liu Yang constructed during the penetration test is a typical example. The model “understands” the contents of the configuration file and that it should act based on the configuration, but its “understanding” is manipulated by the attacker’s carefully crafted context. This is not that the model is “bad”, but that its “understanding” mechanism itself can be manipulated.

Therefore, the old framework failed not because prompt had no value, but because it was given responsibilities that did not belong to it by the team. Prompt can align the behavioral tendencies of the model, but cannot constrain the execution boundaries of the model. When the two are conflated, security becomes a belief rather than a mechanism.

What do we really want to control?

If we want to seriously discuss tool layer governance, the object we control is not the “model output”, but what state the model wants to change the world to, and whether the system allows this change to occur.

From this perspective, a tool call is not an extension of the output, but a permission event. It means:

An intent is being translated into an action
An action is approaching the real environment
The risk boundary of a real environment is about to be touched

Liu Yang likes to use this metaphor: model output is like what a person says, and tool invocation is like a person picking up a hammer. You can make a person “inclined” to hurt others without a hammer through education and guidance, but as long as he has a hammer in his hand, the risk exists. Real safety does not mean that he “doesn’t want” to hurt someone with the hammer, but that the moment he picks up the hammer, there is a mechanism that can ask him: Are you sure you want to use the hammer now? Are you sure you want to use it for this target? Are you sure you have permission to do this?

Therefore, what the system really needs to control is not whether a sentence is compliant, but whether the intention is qualified to penetrate to the execution layer under the current context.

This means that tool-level governance needs to answer several key questions:

What intent is the model currently proposing? (Not just the superficial tool name, but the real goal)
Is this intention reasonable within the context of the current task?
What permissions are required to perform this action? Does the current principal have these permissions?
What risks might this action bring? Are the risks within acceptable limits?
If implemented, are there audit records to explain this decision?

None of these questions can be answered by prompt because they require access to system status, permission information, historical audits, etc. outside of prompt.

A more reliable four-layer framework

Based on these considerations, Liu Yang proposed a four-layer tool governance framework.

Level 1: Unification of Intentions

The actions proposed by the model must first be abstracted into governable intentions instead of directly comparing rules with natural language. Otherwise the rules will always only capture the most superficial patterns.

For example, the model may say “I need to delete the /tmp/old_logs directory to free up space”, it may say “clean up the old log files”, or it may say “execute rm -rf /tmp/old_logs”. These three expressions point to the same intention (cleaning the log), but the surface form is completely different. If you only match keywords, you will either miss variations or accidentally harm normal operations.

The goal of intent normalization is to map different surface expressions to standardized intent types. For example, the above three expressions can be classified into the intent type “File Cleaning”, with parameters: target path, cleaning criteria, and retention policy.

Intent normalization allows subsequent policy decisions to be based on standardized intent types, rather than being plagued by the diversity of surface expressions.

Level 2: Strategic Judgment

After the intentions are normalized, they need to enter the independent strategy layer. What is judged here is not “whether the model has confidence”, but whether this subject, in this task, and in this current resource state, can perform such actions.

Policy decisions require access to multiple sources of information:

Subject identity and permissions: What permission scope does the current Agent instance have? Is it read-only or read-write? Is it limited mode or fully functional mode?
Task context: What type of task is the current task? Is it routine maintenance or emergency repair? Is it a test environment or a production environment?
Resource status: What is the status of the target resource? Is there a protection mechanism? Are there any dependencies?
Historical behavior: What is the recent behavior pattern of this Agent? Are there any abnormalities?
Risk model: What is the risk level of this action? What are the possible consequences?

Based on this information, the policy layer can make several decisions: allow execution, deny execution, require additional confirmation, downgrade execution (such as limiting parameter range), delay execution (waiting for manual approval).

Crucially, this adjudication layer is independent of the semantic links of the model. It doesn’t care about “what the model thinks”, it only cares about “whether it can be done”. Even if the model is “very confident” that an operation is correct, the policy layer has the right to reject it.

Level 3: Execution Encapsulation

Even if an action is permitted, it does not mean that it must be performed naked. A truly mature system will add additional encapsulation before execution: parameter shrinkage, scope restriction, timeout protection, reversible operation priority, and manual confirmation when necessary.

Parameter shrinking means that even if an action is allowed, its parameters are checked and restricted. For example, the “delete file” operation is allowed, but the parameter must be a file in a specific directory, not a critical system file. Scoping means executing in a sandbox environment, limiting its impact on other parts of the system.

Timeout protection ensures that execution can be interrupted even if it falls into an infinite loop or waits for a long time. Reversible Operations Prioritize the use of reversible operations (such as moving to the Recycle Bin instead of deleting directly) to facilitate subsequent recovery.

Manual confirmation is the last line of defense. For high-risk operations, even if all previous inspections are passed, manual final confirmation is still required.

Level 4: Audit Trail

Finally, the system must be able to answer: why this action was allowed, who allowed it, what is the basis, and if a review is to be performed in the future, can this link be fully explained.

Auditing is more than just “keeping a log”. The log tells you “what happened”, and the audit tells you “why it was allowed to happen”. A complete audit record should include:

Original request and context
Intention after unification
Inputs and outputs of policy decisions (which rules are used and what factors are considered)
Measures to implement encapsulation (what restrictions are imposed)
actual execution results
System status after execution

Such audit records make subsequent review possible and provide a data basis for anomaly detection and pattern analysis.

How does this framework guide practical judgment?

With these four layers of framework, when you look at tool management, you won’t just worry about “how to configure the whitelist.” You’ll start asking:

What exactly is the intent of what the model is currently proposing?
Is there a clear risk level for such intentions?
Why is it considered reasonable in the context of the current task?
Even if it makes sense, should it still be executed with lower permissions or narrowed scope?
Can we explain this release clearly afterwards?

This changes tool security from a “rules configuration problem” to an “execution rights design problem.” The difference between the two is huge. The former only cares about whether there are enough rules, while the latter cares about whether the system can make different decisions when there is real danger.

Liu Yang found in practice that the greatest value of this framework is to help the team establish a “questioning” culture. It is no longer “the model says we need to do it, so we do it”, but “the model says we need to do it, let’s ask why first”.

He suggested that the team go through this four-layer framework every time they add new tool support:

What are the possible intentions of this tool? How to unify?
What are the risk levels for different intentions? What strategy is needed?
What restrictions and protections should be in place during execution?
What information needs to be recorded for an audit?

This process adds some work initially, but in the long run, it avoids a lot of potential security issues.

Where are the boundaries of this frame?

Of course, tool firewalls are no panacea. Liu Yang emphasized that it cannot replace input security, credential management, or organizational-level on-duty and rollback capabilities.

Input safety (Prompt injection protection) is still necessary. If an attacker has full control over the model’s inputs, even if the tool is firewalled, the model itself can be manipulated into “voluntarily” proposing dangerous actions. Firewalls can block execution, but they cannot prevent models from being “brainwashed”.

Credentials governance is also an independent issue. The firewall determines “whether it can be done”, but “with what identity” depends on credential management. If the credentials themselves are misused or compromised, the basis for the firewall’s decision-making is compromised.

Organizational-level capabilities are equally important. Firewalls can automatically block many dangerous operations, but there are always situations that require manual judgment. The ability of an organization to respond to these upgrades, make the right decisions, and execute follow-up actions is something that technology cannot solve.

The real position of the tool firewall is to re-cut a problem that has been confused for a long time: the model is responsible for understanding and proposing, and the system is responsible for adjudication and execution. As soon as this boundary is established, many dangerous actions that would otherwise slide along the default path have the opportunity to be explicitly rejected for the first time.

Without this layer, no matter how many alignments you do, you are still hoping that the model will not make mistakes; with this layer, you can really start to design a structure that “even if the model makes mistakes, the system will not make mistakes along with it.”

Conclusion: What truly makes the system safe is not that the model understands the rules better, but that the system finally learns not to accommodate the model at critical moments.

When Liu Yang was doing a penetration test report for a client, he finally wrote this paragraph:

“Most of your current security measures focus on making the model ‘not’ want to do bad things. That’s important, but not enough. True security requires making the system ‘unable’ to do bad things - at least with a mechanism to say ‘no’ when the model proposes a dangerous action.”

We recommend that you build a tool-level firewall, not to add complexity, but to establish the most basic execution boundary. Without this boundary, your security is based on the assumption of ‘good faith’ in the model. And any security based on the assumption of ‘good faith’ is fragile. ”

The client took this advice. Three months later, Liu Yang took a retest. This time, his subtle indirect attacks—context pollution, configuration file manipulation, task chain injection—were blocked by the firewall. It’s not that the model “doesn’t want” to execute, but that the system “doesn’t allow” it to execute.

This difference is the watershed between safety and insecurity.

So I don’t think tool firewalls are “icing on the cake”. On many teams, it’s more like the first really decent brake pad. Without it, the stronger the system, the greater the risk; with it, the system has the opportunity to keep both capacity growth and risk control within a manageable range.

This is a core belief of Liu Yang’s more than ten years of security career: Security is not about preventing all bad things from happening - that is impossible - but about ensuring that when bad things happen, the system has mechanisms to prevent it from getting worse. The tool firewall is the most basic and critical layer of mechanism in the OpenClaw system.

References and Acknowledgments

Original text: Show HN: OkaiDokai, tool-level firewall for OpenClaw, Claude Code and Codex: https://okaidokai.com

Series context

You are reading: OpenClaw in-depth interpretation

This is article 7 of 10. Reading progress is stored only in this browser so the full series page can resume from the right entry.

View full series →

Reading path

Continue along this topic path

Follow the recommended order for OpenClaw security in-depth interpretation instead of jumping through random articles in the same topic.

View full topic path →

Next step

Go deeper into this topic

If this article is useful, continue from the topic page or subscribe to follow later updates.

Original interpretation: Why what OpenClaw really lacks is not more prompt words, but a tool firewall that dares to say 'no'

Introduction: “Don’t do it” at the language layer does not mean “it can’t be done” at the execution layer.

Why old frameworks fail

What do we really want to control?

A more reliable four-layer framework

How does this framework guide practical judgment?

Where are the boundaries of this frame?

Conclusion: What truly makes the system safe is not that the model understands the rules better, but that the system finally learns not to accommodate the model at critical moments.

References and Acknowledgments

You are reading: OpenClaw in-depth interpretation

Current series chapters

Continue along this topic path

Go deeper into this topic

Subscribe to updates

Comments and discussion

Introduction: “Don’t do it” at the language layer does not mean “it can’t be done” at the execution layer.

Why old frameworks fail

What do we really want to control?

A more reliable four-layer framework

How does this framework guide practical judgment?

Where are the boundaries of this frame?

Conclusion: What truly makes the system safe is not that the model understands the rules better, but that the system finally learns not to accommodate the model at critical moments.

References and Acknowledgments

You are reading: OpenClaw in-depth interpretation

Current series chapters

Continue along this topic path

Original interpretation: Why do OpenClaw security incidents always happen after 'the risk is already known'?

Original interpretation: Why is the lightweight Agent solution likely to be closer to production reality than the 'big and comprehensive' solution?

Original interpretation: Treat Notion as the control plane of 18 Agents. The first thing to solve is never 'automation'

Continue with this topic

Overview of in-depth interpretation of OpenClaw (10 articles)

Original interpretation: Putting Agent into ESP32, the easiest thing to avoid is not the performance pit, but the boundary illusion.

Original interpretation: When OpenClaw costs get out of control, the first thing to break is never the unit price, but the judgment framework.

Go deeper into this topic

Subscribe to updates

Comments and discussion