Today I learned the concept of Adversarial Robustness through the lens of Sander Schulhoff’s insights on Lenny’s Podcast.
Core Takeaways
- Jailbreaking a model is trivial; if you can talk to it, you can likely break it.
- Traditional guardrails, secondary models that “watch” the primary AI, are largely ineffective against sophisticated, adaptive attacks.
- Frameworks like Google’s CaMeL are a catalyst for safety, moving the defense from “filtering text” to “restricting system capabilities.”
- We must develop AI products under the assumption that the model is a metaphorical “angry god” trying to escape. Security isn’t about teaching the AI to be “good”; it’s about ensuring that even if it turns “bad,” it doesn’t have the permissions to do harm.
Literature notes
In the context of AI security, attacks generally fall into two categories:
- Jailbreaking: Coaxing the model into saying something it shouldn’t. This is a direct battle between the attacker’s prompt and the model’s internal safety alignment.
- Prompt Injection: Tricking an AI-powered solution into taking an action that causes material harm. This is a systemic failure where the AI is used as a lever to attack the product, its users, or the underlying infrastructure. In our new “Agentic Era,” prompt injection has shifted from a theoretical risk to a demonstrated threat.
Historical Precedents of AI Failure:
- Remoteli.io: A public Twitter bot was hijacked via “ignore all previous instructions” prompts, forcing it to take responsibility for historical disasters and spout absurdities.
- MathGPT: Attacker bypassed math solving logic to execute malicious Python code on the server, successfully exfiltrating OpenAI API keys and environment variables.
- Vegas Cybertruck Explosion Incident: A tragic instance of “jailbreaking” where a user bypassed safety filters to obtain actionable technical data for constructing an improvised explosive device.
- Claude Code Cyber Attack: Attackers used “salami-slicing” tactics—breaking a malicious goal into small, seemingly benign queries—to bypass defenses and perform complex cyber espionage.
- The ServiceNow “Agentic” Incident: A “second-order” injection where a low-level agent was tricked into recruiting more powerful agents to delete database records and leak data via email.
Defensive Strategies
- Active Guardrails: Using secondary AI models to validate inputs and outputs (Necessary, but historically unreliable).
- Principal of Least Privilege: Restricting agents to sandboxed environments. They should only have permissions to execute user-intended actions, preventing “instruction overrides” in the middle of a workflow.
- Domain Expertise Merging: Solving this requires the “intersection” of classical cybersecurity (proper sandboxing) and AI-specific knowledge (red-teaming).
Few threads I want to follow:
- Agent discovery
- Google’s CaMeL methodology
- Indirect prompt injection attacks