Treat the model like an angry god in a box

The most useful mental model I’ve found for AI security in 2026: treat the model like an angry god in a box, trying to escape.

The work isn’t teaching the AI to be good. It’s making sure that even when it turns bad — and at some point it will — it doesn’t have the ability to do harm.

Core Takeaways

Jailbreaking a model is trivial; if you can talk to it, you can likely break it.
Traditional guardrails — secondary models that “watch” the primary AI — are largely ineffective against sophisticated, adaptive attacks.
Frameworks like Google’s CaMeL move the defense from filtering text to restricting system capabilities.

Literature notes

Two categories of attacks

Jailbreaking: Coaxing the model into saying something it shouldn’t. This is a direct battle between the attacker’s prompt and the model’s internal safety alignment.
Prompt Injection: Tricking an AI-powered solution into taking an action that causes material harm. This is a systemic failure where the AI is used as a lever to attack the product, its users, or the underlying infrastructure.

In our new “Agentic Era,” prompt injection has shifted from a theoretical risk to a demonstrated threat.

Historical precedents of AI failure

Remoteli.io: A public Twitter bot was hijacked via “ignore all previous instructions” prompts, forcing it to take responsibility for historical disasters and spout absurdities.
MathGPT: Attacker bypassed math solving logic to execute malicious Python code on the server, successfully exfiltrating OpenAI API keys and environment variables.
Vegas Cybertruck Explosion Incident: A tragic instance of “jailbreaking” where a user bypassed safety filters to obtain actionable technical data for constructing an improvised explosive device.
Claude Code Cyber Attack: Attackers used “salami-slicing” tactics — breaking a malicious goal into small, seemingly benign queries — to bypass defenses and perform complex cyber espionage.
The ServiceNow “Agentic” Incident: A “second-order” injection where a low-level agent was tricked into recruiting more powerful agents to delete database records and leak data via email.

Defensive strategies

Active Guardrails: Using secondary AI models to validate inputs and outputs. Necessary, but historically unreliable.
Principle of Least Privilege: Restricting agents to sandboxed environments. They should only have permissions to execute user-intended actions, preventing “instruction overrides” in the middle of a workflow.
Domain Expertise Merging: Solving this requires the intersection of classical cybersecurity (proper sandboxing) and AI-specific knowledge (red-teaming).

A few threads I want to follow

How AI agents discover each other — how do agents find each other in the wild?
Google’s CaMeL methodology — what does capability-restriction look like in practice?
Indirect prompt injection — beyond the obvious “ignore previous instructions” vector

All this comes from Sander Schulhoff on Lenny’s Podcast.

goutham's notes

Explorer