The most useful mental model I’ve found for AI security in 2026: treat the model like an angry god in a box, trying to escape.
The work isn’t teaching the AI to be good. It’s making sure that even when it turns bad — and at some point it will — it doesn’t have the ability to do harm.
Core Takeaways
- Jailbreaking a model is trivial; if you can talk to it, you can likely break it.
- Traditional guardrails — secondary models that “watch” the primary AI — are largely ineffective against sophisticated, adaptive attacks.
- Frameworks like Google’s CaMeL move the defense from filtering text to restricting system capabilities.
Literature notes
Two categories of attacks
- Jailbreaking: Coaxing the model into saying something it shouldn’t. This is a direct battle between the attacker’s prompt and the model’s internal safety alignment.
- Prompt Injection: Tricking an AI-powered solution into taking an action that causes material harm. This is a systemic failure where the AI is used as a lever to attack the product, its users, or the underlying infrastructure.
In our new “Agentic Era,” prompt injection has shifted from a theoretical risk to a demonstrated threat.
Historical precedents of AI failure
- Remoteli.io: A public Twitter bot was hijacked via “ignore all previous instructions” prompts, forcing it to take responsibility for historical disasters and spout absurdities.
- MathGPT: Attacker bypassed math solving logic to execute malicious Python code on the server, successfully exfiltrating OpenAI API keys and environment variables.
- Vegas Cybertruck Explosion Incident: A tragic instance of “jailbreaking” where a user bypassed safety filters to obtain actionable technical data for constructing an improvised explosive device.
- Claude Code Cyber Attack: Attackers used “salami-slicing” tactics — breaking a malicious goal into small, seemingly benign queries — to bypass defenses and perform complex cyber espionage.
- The ServiceNow “Agentic” Incident: A “second-order” injection where a low-level agent was tricked into recruiting more powerful agents to delete database records and leak data via email.
Defensive strategies
- Active Guardrails: Using secondary AI models to validate inputs and outputs. Necessary, but historically unreliable.
- Principle of Least Privilege: Restricting agents to sandboxed environments. They should only have permissions to execute user-intended actions, preventing “instruction overrides” in the middle of a workflow.
- Domain Expertise Merging: Solving this requires the intersection of classical cybersecurity (proper sandboxing) and AI-specific knowledge (red-teaming).
A few threads I want to follow
- How AI agents discover each other — how do agents find each other in the wild?
- Google’s CaMeL methodology — what does capability-restriction look like in practice?
- Indirect prompt injection — beyond the obvious “ignore previous instructions” vector
All this comes from Sander Schulhoff on Lenny’s Podcast.