Sandboxing vs Monitoring: Two Very Different Approaches to AI Agent Safety

When teams start thinking seriously about AI agent safety, they usually end up in one of two camps: those who focus on monitoring what agents do, and those who focus on constraining what agents can do. These are very different approaches, with very different properties.

The short version: monitoring tells you what happened after the fact. Sandboxing limits what can happen in the first place. Both matter — but starting with monitoring and skipping sandboxing is a common mistake with serious consequences.

What monitoring gives you

Agent monitoring typically includes:

Logging agent inputs and outputs
Tracking latency, cost, and error rates
Anomaly detection on action patterns
Alerting when behavior deviates from baseline
Dashboards showing agent activity in aggregate

These are genuinely useful. Monitoring is how you learn what your agents are doing in production, detect regressions when models change, and investigate incidents after they occur. Any serious agent deployment needs good monitoring.

But monitoring has a fundamental property: it is reactive. It tells you what happened. And for AI agents, "what happened" can include a lot of damage in a very short time.

The detection latency problem

The most dangerous failure modes for AI agents — runaway loops, prompt injection attacks, permission over-use — can cause significant damage in seconds or minutes. Most monitoring systems have detection latency measured in minutes to hours, especially when human review is in the loop.

Consider: a billing notification agent enters a loop at 2:17 AM. It sends 4,000 emails by 2:18 AM. Your on-call gets paged at 2:24 AM after a Datadog alert fires. The damage has already been done.

Monitoring would tell you exactly what happened. It would not have prevented it.

What sandboxing gives you

Sandboxing constrains what agents can do before they do it. This includes:

Permission manifests: Declaring what resources an agent is allowed to access, enforced at the execution layer rather than the prompt layer.
Network policy enforcement: Allowlist-only egress that prevents data exfiltration regardless of what the model decides to do.
Rate limits on actions: Caps on how many tool calls can be made per unit time, preventing runaway loops from causing large-scale damage.
Automatic kill-switches: Execution termination when budget, time, or action count thresholds are exceeded.

These are preventive controls. They don't tell you what happened — they prevent certain things from happening in the first place. The email loop at 2:17 AM sends 3 emails (the rate limit) and then terminates. The alert fires, but the blast radius is 3 emails, not 4,000.

Why monitoring-first is a trap

Monitoring is easier to implement than sandboxing. You instrument your existing agent code, ship logs to a collector, build some dashboards, set up some alerts. Done in a weekend.

Sandboxing requires more design work. You need to define permission manifests for each agent. You need to think through rate limits. You need to decide which actions should trigger kill-switches. It requires upfront thinking about failure modes that many teams would rather defer.

So teams ship monitoring and feel safer than they are. The monitoring gives them visibility — which is real and valuable — but it creates a false sense of having addressed the safety question. "We have monitoring" becomes a proxy for "we have controls."

These are not the same thing. Visibility is not protection.

The analogy to application security

This mirrors a mistake that was common in application security a decade ago: logging instead of preventing. Teams would add extensive logging to an application with SQL injection vulnerabilities, so they'd know when they were attacked — but they weren't using parameterized queries to prevent the attack in the first place.

"We have logs" does not mean "we're secure." The mature posture in application security is prevention first, detection second. The same principle applies to AI agent safety.

Building the full stack

The right architecture is both, in the right order:

Layer 1: Sandboxing (prevention): Define the enclosure. Set permission manifests, network policies, rate limits, and kill-switch thresholds. This is your first line of defense. Most incidents that would have been serious are stopped here.

Layer 2: Monitoring (detection): Instrument everything inside the enclosure. Log all actions, costs, and decisions. Alert on anomalies. Use this data to improve your sandboxing — incidents that get through layer 1 become evidence for tightening your enclosure manifest.

Layer 3: Audit (compliance and forensics): Maintain a tamper-evident record of agent activity for compliance review, post-incident analysis, and regulatory requirements. This layer is often conflated with monitoring but serves a different purpose — it's about accountability and proof, not real-time detection.

The posture: Monitoring without sandboxing is dashboards without seatbelts. You'll know exactly what went wrong. You just won't have been able to prevent it. Build your fence first, then add your cameras.

Getting started

If you're deploying agents now and wondering where to start, the answer is: define your permission manifests first. Even a rough version — "this agent is allowed to read from tables A and B and call external APIs X and Y" — is enormously better than no manifest at all. Write it down, enforce it at the infrastructure layer if you can, and review it before each major agent change.

Monitoring is your second step, not your first. Build it after you understand what you're trying to detect — and you understand what you're trying to detect only after you've been specific about what you intend to allow.

Sandboxing vs monitoring: two very different approaches to AI agent safety

What monitoring gives you

The detection latency problem

What sandboxing gives you

Why monitoring-first is a trap

The analogy to application security

Building the full stack

Getting started

Prevention first. Visibility always.