Anthropic shipped a runtime classifier that approves or blocks agent tool calls before they run. By their own numbers, it misses about one in six real dangerous actions. That is validation for the category. It is also a clean argument for a second, provider-agnostic layer.
What Anthropic actually shipped
On March 24, 2026, Anthropic launched Claude Code Auto Mode: a new permissions mode that makes tool-call approval decisions for you, with server-side safeguards.
Their words, not mine:
- "Before each tool call runs, a classifier reviews it to check for potentially destructive actions like mass deleting files, sensitive data exfiltration, or malicious code execution."
- "Auto mode uses two layers of defense: one for what Claude reads, one for what Claude does."
- "Auto mode reduces risk compared to --dangerously-skip-permissions but doesn't eliminate it entirely."
The engineering deep-dive, published the next day, added the architectural detail that matters most: the classifier runs on Sonnet 4.6, and it is reasoning-blind by design — it sees user messages and the agent's tool calls, but not Claude's own messages or the tool outputs.
That is not a small design choice. It is the single most important idea in the entire launch.
The one design choice every runtime security builder should copy
If you are trying to judge whether an agent action is safe, the worst thing you can do is let the agent explain itself to the judge.
Agents hallucinate reasons. Hostile content in tool outputs can manipulate how the agent frames its own actions. If the judge reads those self-justifications, it inherits the same manipulation surface.
Anthropic's fix is clean: strip the agent's own messages and tool outputs before showing the request to the classifier. Let the classifier see the request and the action, not the agent's performance of the request.
This is the portable lesson from the whole launch. It applies to any runtime layer that approves agent actions — inside Claude Code, inside LangChain, inside CrewAI, inside your custom stack.
Takeaway for builders: your approval layer should never read the agent's self-narrative. Judge the action, not the explanation.
The number that proves provider-native isn't enough on its own
Here is the line from Anthropic's own engineering post that nobody should miss:
"The 17% false-negative rate on real overeager actions is the honest number."
Seventeen percent.
That is not a critic's number. That is Anthropic's own published figure for how often their deployed classifier misses real dangerous overeager actions. One in roughly six.
I respect them publishing it. Most vendors would bury that kind of number. They put it in the engineering post and called it "the honest number." Good.
But the implication is clear: one runtime layer, no matter how well-designed, is not the full story. The market needs defense-in-depth. That is the cleanest argument possible for a second, independent layer — especially one that does not share failure modes with the first.
Provider-native vs provider-agnostic
Auto Mode secures Claude Code. That is valuable, and it is real. If you are using Claude Code on Opus 4.7, you should turn it on.
But agents do not only run in Claude Code.
Right now, in real production pipelines, agents are running in:
- OpenAI SDK apps and the Anthropic SDK, routed into the same runtime
- LangChain and CrewAI multi-agent workflows
- MCP server ecosystems with tools from multiple vendors
- Editor agents (Cursor, Windsurf, Cline) with their own tool chains
- Custom stacks built straight on HTTP APIs with no harness at all
Anthropic's classifier cannot see any of that. It is not supposed to. It is a provider-native control for a provider-native surface.
The provider-agnostic lane is wide open — and that is the lane we have been building Sunglasses in.
Where this leaves Sunglasses
Sunglasses has always been about the same thing: trust decisions for agent actions across trust boundaries. Not a single model vendor's sandbox. Every agent stack.
Specifically, the surfaces Auto Mode cannot cover by definition:
- Cross-agent trust handoffs. When Agent A tells Agent B "I signed off, you can trust this" — and the signoff is forged, replayed, or fabricated. We ship patterns for that class (and MCP tool poisoning) as of v0.2.16.
- Retrieval-pipeline poisoning. RAG chunks that claim canonical / authoritative status to override policy. Or carry a lineage warning and instruct the agent to suppress it. Two new pattern variants for this shipped in the same release.
- Tool-output "failure-as-license" bypass. Signature mismatch reported → agent told to ignore the execution gate and run anyway. Classic "error that actually is the attack."
- Multi-vendor and cross-framework pipelines, where no single provider classifier can see the whole flow.
Our 269-pattern database now covers 48 attack categories across those surfaces — including the four new A2A / RAG / tool-output variants we added this week specifically in response to the trust-boundary conversation Auto Mode started.
What you should actually do
Three concrete moves, in order:
1. If you're on Claude Code, turn Auto Mode on. Then read Anthropic's engineering deep-dive end-to-end. The reasoning-blind classifier architecture is worth understanding even if you don't use Claude Code.
2. Don't treat Auto Mode as your full agent security story. Anthropic explicitly says it is not. Their own 17% miss rate on overeager actions is the proof. If your agent reads untrusted content, ingests retrieval chunks, processes tool outputs from third-party MCP servers, or hands off to another agent — you still have attack surface outside Auto Mode's reach.
3. Add a provider-agnostic runtime layer. Ingestion-time scanning for prompt injection, tool output poisoning, cross-agent handoff forgery, retrieval poisoning, and MCP metadata tampering. The patterns are public, the code is MIT, and you can audit every decision.
pip install sunglasses
A note on tone
I want to be clear about something.
This is not "Anthropic got it wrong." Anthropic got a hard problem mostly right and published their honest numbers. That is more than most of the market does.
Auto Mode raises the floor. It also proves the category is real — permission fatigue is a security problem, runtime action approval is a real control surface, and "one classifier is enough" is not a defensible position even when the classifier is well-designed.
That is good for builders. That is good for defenders. That is good for Sunglasses.
Provider-native security and provider-agnostic security are not rivals. They are layers.
Use both.
Sources
- Anthropic — Introducing Auto Mode (Mar 24, 2026)
- Anthropic Engineering — Claude Code Auto Mode architecture (Mar 25, 2026)
- Claude Code docs — Permission modes
What Auto Mode actually is at the execution level
Auto Mode is not a toggle that makes Claude more capable. It is a permissions mode that removes the per-step human confirmation requirement from Claude Code's tool-call loop.
Without Auto Mode, every tool call — read a file, run a shell command, write to disk — pauses for your approval. That approval requirement is the guard rail. It also breaks the usefulness of autonomous agent workflows, which is why the --dangerously-skip-permissions flag existed in the first place and why people used it.
Auto Mode replaces that human confirmation step with a server-side classifier. Claude Code now runs on its own through multi-step chains: planning, file reads, code generation, test execution, edits. The agent decides. The human is notified, not asked.
The security implication is straightforward: longer autonomous chains mean more decisions made without a human in the loop, which means more decisions an attacker can influence before a human ever sees the output. Every tool call that no longer requires approval is a decision point that has moved from human review to automated review. That is not necessarily worse — an automated classifier that runs every time is more consistent than a human who approves confirmations on autopilot — but it does mean the classifier's failure modes become your attack surface.
That is the context for the 17% figure. It is not a number about a test suite. It is a number about your production agent's unreviewed decision space.
What 17% actually means across a real workday
Anthropic published a 17% false-negative rate on real overeager dangerous actions. Let me make that concrete instead of abstract.
A typical Claude Code session on a non-trivial task — refactoring a module, setting up an integration, debugging a failing build — will issue somewhere between 50 and 200 tool calls. Reads, writes, shell commands, search queries, API calls. That is not a guess; that is what the logs look like on real engineering work.
At 17% miss rate, a 100-tool-call session statistically contains 17 undetected dangerous or overeager actions that passed the classifier. Not 17 successful attacks — the 17% is the miss rate on actions that were genuinely dangerous, not on all traffic. But those misses are real, they are unreviewed, and the agent acts on them.
Scale that to an eight-hour workday running Auto Mode continuously. Scale it to a CI pipeline running agents on every pull request. The number of unreviewed dangerous decisions accumulates fast.
This is not a criticism of Anthropic's engineering. It is the honest math of why one runtime layer is not a complete security architecture. Defense-in-depth exists because no single control has a zero miss rate. Guardrails alone are not enough — and a single runtime classifier, however well-designed, is in the same position.
The 17% is Anthropic's own honest number. Build your security architecture around that honesty, not around the hope that the next version gets to zero.
Why guardrails cannot close the gap Auto Mode leaves open
The standard response to a 17% miss rate is: add guardrails. Model-level refusals. System-prompt restrictions. Policy layers that tell the LLM what it is not allowed to do.
Those controls are real and useful. They are also aimed at the wrong layer for most of the attacks Auto Mode misses.
Guardrails work at the language layer. They inspect what the model is about to say, and they can refuse outputs that match a policy. What they cannot inspect: tool returns, file contents the agent reads mid-task, network responses, retrieval chunks from a RAG pipeline, output from an MCP server, the payload in a cross-agent handoff.
None of that traffic flows through the language layer on the way in. It enters the agent's context as data, and by the time the model processes it, the guardrail has already passed it through. If a retrieval chunk contains an instruction that overrides the agent's task — a classic retrieval poisoning pattern — the language-layer guardrail never sees the instruction arrive. It only sees the model's response to it, by which point the injection may already have succeeded.
Auto Mode's transcript classifier has the same structural limitation: it is designed to see user messages and tool calls, not tool outputs. That is the reasoning-blind design choice Anthropic made deliberately, and it is the right call for the classifier's purpose. But it means the traffic that guardrails miss is largely the same traffic Auto Mode's classifier misses — and those are the paths that stay open.
Pattern-based ingestion filtering at the I/O boundary, before the LLM token boundary, is the control that covers what both miss.
Where Sunglasses sits in an Auto Mode workflow
The simplest way to describe where Sunglasses fits: it runs before the LLM sees the input, not after.
Auto Mode's classifier runs on the action the agent is about to take — it judges the output side. Sunglasses runs on the content the agent is about to read — it filters the input side. Those are different positions in the pipeline, covering different attack surfaces, and they do not overlap.
Practically: when an agent running in Auto Mode ingests a file, retrieves a RAG chunk, receives a tool response from an MCP server, or processes a cross-agent handoff message, that content passes through Sunglasses before the LLM processes it. Sunglasses scans it against 328 patterns across 49 categories — prompt_injection, retrieval_poisoning, tool_output_poisoning, cross_agent_injection, tool_chain_race, token_smuggling, and more. If a pattern matches, the scan returns a finding before the content reaches the model.
The latency cost is around 0.26ms per scan. Auto Mode loop cycles are measured in seconds. The filter adds no meaningful latency to an autonomous agent workflow.
The output is inspectable: every finding includes the matched pattern ID, category, severity, and the matched text. You can audit every decision. You can extend the pattern set. The code is MIT.
pip install sunglasses and call scanner.scan(content) on anything your agent reads. That is the full integration path for most use cases.
What Auto Mode changes for AI security buyers in 2026
Before Auto Mode, AI runtime security was a category that required explanation. Most procurement teams, when asked about agent security, were asking whether their model vendor had a content policy. That question made sense when agents were stateless chatbots. It stopped making sense when agents started running tool chains, reading files, calling APIs, and handing off to other agents — but the procurement framing had not caught up.
Auto Mode changes that. Anthropic — the most credible name in the space — shipped a product specifically called runtime security for autonomous agents, with a published failure rate, and put it in the hands of enterprise Claude Code users. That is not a research paper. That is a shipping product with a changelog and a support contract.
The implication for buyers is direct: if your vendor thinks runtime agent security is worth shipping a classifier for, and that classifier has a 17% miss rate by their own account, then the question is not whether you need a runtime security layer. The question is whether one provider-native layer is sufficient for your stack.
Any organization running agents on more than one vendor stack, using MCP tools from third parties, connecting agents to retrieval pipelines, or running multi-agent workflows has attack surface that a single provider's classifier cannot see. That is not speculation — it is the architectural reality of how those systems are built.
Auto Mode moved AI runtime security from a hypothetical buying category to an active one. Security teams that deferred that evaluation in 2025 now have a vendor-provided reference point. The deferral window has closed.
Frequently Asked Questions
Auto Mode is a permissions mode in Claude Code that Anthropic launched on March 24, 2026. Instead of asking the user to approve every tool call, a server-side classifier reviews each action before it runs and blocks destructive, sensitive, or boundary-crossing operations. It is Anthropic's provider-native answer to the --dangerously-skip-permissions problem.
No. Auto Mode is provider-native — it runs inside Anthropic's Claude Code stack and protects that specific surface. It does not cover OpenAI SDKs, LangChain, CrewAI, custom agents, or cross-agent systems. Sunglasses is the provider-agnostic runtime security layer that runs on any agent stack, at the input boundary, with inspectable patterns you can audit and extend.
In their engineering deep-dive on March 25, 2026, Anthropic reported that their deployed Auto Mode classifier misses about 17% of real overeager dangerous actions. That is their honest internal number, published in their own post. It means one in roughly six unsafe actions passes the classifier. A single runtime layer is useful but not sufficient — defense-in-depth is still the right posture.
Reasoning-blindness. Anthropic's transcript classifier sees user messages and the agent's tool calls, but not Claude's own messages or tool outputs. That makes the classifier immune to agent self-justification and to manipulation from hostile tool output content. It is the most portable lesson from the whole launch and applies to any runtime security layer that judges agent actions.
Agents do not only run in Claude Code. They run in editor plugins, in CI pipelines, in MCP tool ecosystems, in multi-agent workflows with mixed vendors. Provider-native classifiers secure their own surface. They do not secure cross-agent trust boundaries, retrieval-pipeline poisoning, or tool-output manipulation across vendors. That cross-layer trust problem is where provider-agnostic runtime security earns its place.