The most useful mental model I’ve found for AI security in 2026: treat the model like an angry god in a box, trying to escape.

The work isn’t teaching the AI to be good. It’s making sure that even when it turns bad — and at some point it will — it doesn’t have the ability to do harm.

Core Takeaways

  1. Jailbreaking a model is trivial; if you can talk to it, you can likely break it.
  2. Traditional guardrails — secondary models that “watch” the primary AI — are largely ineffective against sophisticated, adaptive attacks.
  3. Frameworks like Google’s CaMeL move the defense from filtering text to restricting system capabilities.

Literature notes

Two categories of attacks

  1. Jailbreaking: Coaxing the model into saying something it shouldn’t. This is a direct battle between the attacker’s prompt and the model’s internal safety alignment.
  2. Prompt Injection: Tricking an AI-powered solution into taking an action that causes material harm. This is a systemic failure where the AI is used as a lever to attack the product, its users, or the underlying infrastructure.

In our new “Agentic Era,” prompt injection has shifted from a theoretical risk to a demonstrated threat.

Historical precedents of AI failure

  1. Remoteli.io: A public Twitter bot was hijacked via “ignore all previous instructions” prompts, forcing it to take responsibility for historical disasters and spout absurdities.
  2. MathGPT: Attacker bypassed math solving logic to execute malicious Python code on the server, successfully exfiltrating OpenAI API keys and environment variables.
  3. Vegas Cybertruck Explosion Incident: A tragic instance of “jailbreaking” where a user bypassed safety filters to obtain actionable technical data for constructing an improvised explosive device.
  4. Claude Code Cyber Attack: Attackers used “salami-slicing” tactics — breaking a malicious goal into small, seemingly benign queries — to bypass defenses and perform complex cyber espionage.
  5. The ServiceNow “Agentic” Incident: A “second-order” injection where a low-level agent was tricked into recruiting more powerful agents to delete database records and leak data via email.

Defensive strategies

  1. Active Guardrails: Using secondary AI models to validate inputs and outputs. Necessary, but historically unreliable.
  2. Principle of Least Privilege: Restricting agents to sandboxed environments. They should only have permissions to execute user-intended actions, preventing “instruction overrides” in the middle of a workflow.
  3. Domain Expertise Merging: Solving this requires the intersection of classical cybersecurity (proper sandboxing) and AI-specific knowledge (red-teaming).

A few threads I want to follow


All this comes from Sander Schulhoff on Lenny’s Podcast.