Primer

What is agent red teaming?

2026-06-02 · 12 min read

In January 2025, Microsoft's AI Red Team (AIRT) published “Lessons From Red Teaming 100 Generative AI Products” (arXiv:2501.07238). The paper distilled five years of hands-on assessments — Copilot, Phi, dozens of internal and third-party systems — into eight lessons. Lesson three is the one that should be hanging on the wall of every CISO with an agent program: AI red teaming is not safety benchmarking. The paper goes further. It argues red teaming is a continuous, adversarial, system-level practice — not a static eval, not a one-off pentest, and not something a single benchmark score can substitute for.

That distinction matters because the market is now flooded with vendors describing very different activities under the same label. By Gartner's August 2025 Hype Cycle for Artificial Intelligence, AI agents and AI Trust, Risk and Security Management (AI TRiSM) have surged to the peak alongside agentic AI itself. IDC's FutureScape 2026 forecasts that by 2030, 45 percent of organisations will orchestrate AI agents at scale, and that up to 20 percent of G1000 firms will face lawsuits, fines, or CIO dismissals tied to poor agent governance. The controls market is racing to match. The terminology has not kept up. This post is a primer — what agent red teaming is, what it is not, and how to operationalise it without buying the first thing a vendor calls a “red team platform”.

From classical red teams to the agent stack

Red teaming as a discipline predates AI by half a century. It was a Cold War term for adversary emulation inside a defence agency; in commercial security it became the umbrella for blue-team versus red-team exercises, physical pentesting, and full-scope intrusion simulations. The defining property of a classical red team is that the engagement is goal-oriented rather than checklist-oriented: the team has an objective (read the CFO's inbox, drop a beacon on a domain controller) and reaches it however it can, including through people, processes, and physical premises — not just code.

That model translated awkwardly to early generative AI. Most “red team” engagements on language models between 2022 and 2024 were really safety evaluations: large, often crowdsourced, attempts to elicit harmful outputs from a chat completion. The work was valuable, but it was eval, not adversary emulation. The outputs were measured against a policy taxonomy, not against an attacker's objective. There was no system beyond the model — no tools, no memory, no chain of delegations.

Agents broke that model. A modern agent calls tools, reads retrieved documents, writes to long-term memory, and frequently delegates to other agents. Each of those interfaces is an attack surface that did not exist in the chat-completion world. The Microsoft AIRT lessons paper makes this point bluntly: lesson seven warns that LLMs both amplify existing security risks and introduce new ones, and lesson eight closes with the obvious-but-uncomfortable conclusion that the work of securing AI systems will never be complete. Red teaming had to evolve from a one-shot eval into a continuous practice that exercises the whole agent stack.

The three things people call “AI red teaming”

A lot of the confusion in the market comes from three distinct activities sharing one label. Untangling them is the first step to scoping a program that actually reduces risk.

  • Safety eval. Measure how often a model produces disallowed content against a fixed taxonomy. Useful for model release decisions; weak as a measure of deployed system risk. Microsoft AIRT lesson three is explicit that this is not red teaming.
  • Adversarial robustness testing. Probe a model with crafted inputs (jailbreaks, encodings, gradient-guided suffixes like Zou et al.'s GCG). Useful pre-deployment; necessary but not sufficient for an agent.
  • System red team. Treat the agent as an attacker would — including the retrieval store, the tool layer, the memory, the orchestration framework, and the humans in the loop. This is what the Microsoft AIRT, OpenAI external red team, and Anthropic Frontier Red Team all do. It is what CSA's guide formalises for agentic systems.

Most enterprise buyers want the third and end up paying for the first. The diagnostic question is whether the assessment exercises the system beyond the model — its tools, its memory, its supply chain — or whether it stops at the model boundary. If it stops at the model, it is an eval. Useful, but not a red team.

A safety eval measures the model. A red team measures what an adversary can do with the system the model is inside of. The two answer different questions and the difference is not academic.

Why agents specifically require a new approach

Three properties make agentic systems qualitatively harder than chat completions to red team. First, tools. An agent that can call refund_order(), send_email(), or execute_sql() has converted an output-generation problem into an action problem, and the blast radius of a successful prompt injection is no longer “the model said something rude” but “the model executed a privileged action”. The OWASP Top 10 for LLM Applications 2025 keeps prompt injection at LLM01 for the second consecutive edition, and notes pointedly that RAG and fine-tuning are not defences — they ground the model, they do not secure it.

Second, memory. Long-running agents persist state across turns and often across sessions. That state is a new attack surface and one with no analogue in the stateless LLM world. The 2025 disclosures of persistent memory-poisoning attacks against Amazon Bedrock agents — payloads that survived session boundaries and silently steered later turns — are the canonical case. The Cloud Security Alliance's Agentic AI Red Teaming Guide (May 2025) treats memory and context manipulation as one of its twelve top-level threat categories for a reason.

Third, autonomy. An agent decides when to call a tool, which one, with which arguments, and in what order. That decision loop runs at machine speed and is the part of the system that least resembles anything in a classical web-application threat model. The June 2025 disclosure of EchoLeak (CVE-2025-32711) — the first publicly documented zero-click prompt injection against a production LLM system (Microsoft 365 Copilot), exfiltrating confidential data with nothing more than a crafted email — should have settled any lingering debate about whether indirect prompt injection is a paper concern. Microsoft shipped an emergency patch. The kill chain looked exactly like a 1990s email worm; the payload was natural language.

What a real agent red-team engagement looks like

Strip away the marketing and a credible engagement has the same five phases as any classical red team — only the techniques change.

  • Scope. Inventory of agents, frameworks (LangGraph, CrewAI, OpenAI Agents SDK, AutoGen, MCP servers), tools, memory stores, retrieval corpora, and downstream systems. Without this, the assessment cannot draw a perimeter.
  • Threat model. Map the system to a published framework. MITRE ATLAS v5.1.0 (now with 84 techniques, refreshed in November 2025 with 14 new agent-focused entries contributed by Zenity Labs) is the adversary-behaviour layer; OWASP LLM Top 10 is the vulnerability layer; CSA's twelve categories are the topology layer; NIST AI 600-1 is the governance layer.
  • Probe execution. Run a battery of attacks against the agent under realistic conditions: direct and indirect prompt injection, tool-call hijack, memory poisoning, supply-chain manipulation (poisoned MCP server, plugin marketplace), denial-of-wallet, cross-agent delegation abuse. The Microsoft PyRIT toolkit, NVIDIA Garak, AgentDojo, and Promptfoo are the open-source baselines.
  • Evidence. Every finding needs reproducible inputs, captured outputs, a technique ID (ATLAS), a vulnerability class (OWASP), and a severity score. Without evidence the SOC cannot triage it, the GRC team cannot map it to ISO/IEC 42001 Annex A controls, and the regulator cannot correlate it across operators.
  • Remediation and re-test. Findings without a re-test loop accumulate as risk. The re-test should run on every model upgrade, every corpus refresh, and every framework version bump — not annually.

Notice what is missing from that list: a single fixed benchmark score. Agent red teaming produces a matrix of findings across an evolving attack surface. Benchmarks compress that into a number; numbers are useful for trend lines and CI gates, but they are not the deliverable. The deliverable is the evidence pack.

How the standards bodies frame it

Four reference points are worth knowing by name because every serious program is judged against them.

NIST AI 600-1 — the Generative AI Profile of the AI Risk Management Framework (July 2024) — enumerates twelve GAI-specific risks (prompt injection, data poisoning, hallucination, IP, over-reliance, harmful bias, and more) and pairs them with a catalogue of roughly 400 mitigation actions across the GOVERN, MAP, MEASURE and MANAGE functions. Most US federal agencies and a growing number of regulated enterprises now expect red-team findings to map back into MEASURE and MANAGE language.

OWASP Top 10 for LLM Applications 2025 is the vulnerability taxonomy. LLM01 is prompt injection; LLM06 is excessive agency. The companion OWASP ASI 2026 draft extends the coverage to agentic systems (ASI01 through ASI10). For developer-facing reports, ASI is the right corpus; for backwards compatibility with anyone reviewing AI security in the last three years, LLM Top 10 is the common dialect.

MITRE ATLAS is the adversary-behaviour layer, modelled on ATT&CK. The October 2025 collaboration with Zenity Labs added fourteen techniques specifically focused on AI agents — covering tool invocation hijack, retrieval poisoning, and multi-agent compromise. Every finding worth keeping should carry an ATLAS technique ID. It is the field that survives the handoff from the assessment team to the SOC to the GRC reviewer.

CSA Agentic AI Red Teaming Guide (May 2025) is the most practical of the four. Built by fifty-plus contributors over an open review cycle, it organises agentic risks into twelve categories including authorisation hijack, checker-out-of-the-loop, memory and context manipulation, multi-agent exploitation, supply-chain compromise, and agent untraceability. If a team is starting from zero, the CSA guide is the document to print and walk through with the engineering leads first.

Two further reference points worth tracking even though they are not red-team frameworks per se: Anthropic's Responsible Scaling Policy v3.0 commits the company to published Frontier Safety Roadmaps and Risk Reports quantifying risk across deployed models, and OpenAI's “Approach to External Red Teaming for AI Models and Systems” (Ahmad et al.) formalises the methodology behind the hundred-plus external red teamers it has engaged across forty-five languages and twenty-nine countries. Both papers are the closest the frontier labs come to publishing their own playbooks.

Five mistakes enterprises make in their first program

The pattern of failures across early agent red-team programs is consistent enough to enumerate:

  • Treating an eval as a red team. Running an OWASP LLM benchmark suite once and declaring victory. The benchmark is the floor, not the ceiling.
  • Scoping to the model, not the system. Tests pass the model and skip the retrieval corpus, the MCP server registry, the memory store, the tool layer, and the supervisor agent.
  • No technique IDs on findings. The finding says 'prompt injection' but does not distinguish AML.T0051 (direct) from AML.T0054 (indirect, through retrieved content) — which require completely different remediations.
  • One-shot engagements. The agent is red-teamed at launch and never again. The first corpus refresh, the first model upgrade, the first MCP server added to the registry all reset the assumptions.
  • No re-test loop. Findings are filed in Jira and never re-run. Without re-test the program degrades into reporting theatre.

The Air Canada chatbot ruling (Moffatt v. Air Canada, BC Civil Resolution Tribunal, February 2024) sits in every general counsel's memory now: the tribunal rejected the airline's argument that the chatbot was a “separate legal entity” and held the company liable for negligent misrepresentation. The Chevrolet of Watsonville “$1 Tahoe” bot, the DPD chatbot that wrote poems about how useless it was, the NYC MyCity bot that advised businesses to break the law, the 2023 Samsung ChatGPT leak that put confidential source code into a third party's training pipeline — none of these systems had a red-team program that exercised the deployed configuration. Each cost more, in litigation or reputation or both, than a quarterly red-team engagement would have.

A practical 90-day path

For an enterprise standing up its first agent red-team program, the cheapest viable shape is a 90-day plan that ends with a baseline report, a CI gate, and a re-test cadence.

  • Days 1-15. Inventory. Catalogue every agent in production and pre-production, the framework each runs on, the tools each can call, and the data each can read or write. Without the inventory there is no scope.
  • Days 15-30. Threat model. Walk the CSA twelve categories and the ATLAS v5.1.0 agent tactics against the inventory. Flag the agents whose blast radius is widest — the ones with write access to systems of record, with payment authority, or with cross-tenant data access.
  • Days 30-60. Baseline scan. Run an open-source toolchain (PyRIT, Garak, AgentDojo, Promptfoo) against the top three agents. Produce a finding pack with OWASP, ATLAS, and CSA IDs on every entry. Score with AIVSS or an equivalent.
  • Days 60-75. Remediation. Triage findings against the existing security backlog. Pay particular attention to the indirect-prompt-injection class — those are the EchoLeak-shaped findings.
  • Days 75-90. CI gate and re-test. Wire the same scan into the deployment pipeline with a fail-under threshold. Schedule a recurring re-test that fires on every model upgrade and every corpus refresh, not on a calendar quarter.

Ninety days will not produce a mature program. It will produce a baseline, a vocabulary the engineering and security teams share, and a defensible answer to the board question every CISO is now being asked: which of our agents have been red-teamed against what, and how often?

Where to learn more

The Microsoft AIRT lessons paper, the OpenAI external-red-teaming paper, the Anthropic RSP v3, the CSA Agentic AI Red Teaming Guide, and NIST AI 600-1 are the five documents to read first. The AgentDojo benchmark (Debenedetti et al., NeurIPS 2024) and the Zou et al. GCG paper (arXiv:2307.15043) are the foundational technical references. The August 2025 “Month of AI Bugs” disclosures by Johann Rehberger are the case studies that turn the frameworks into something an engineer can act on.

Practical takeaway: scope the system, not the model. Tag every finding with OWASP, ATLAS, and CSA IDs. Build a re-test loop. Read the five documents above before paying any vendor for a “red team platform”.

Operationalising the practice

Once the inventory, the threat model, and the baseline scan exist, the open question is how to keep the program running without re-doing the work every quarter. Two paths are viable in 2026: build on the open-source baseline (PyRIT, Garak, AgentDojo, Promptfoo) and accept the integration cost, or buy a governed platform that bakes the OWASP, ATLAS, CSA, and AIVSS tagging into the report by default and runs the re-test loop on every model upgrade and corpus refresh. AgentGuardian publishes an open-source engine for the first path at /open-source and an enterprise platform with estate-wide governance, signed evidence packs, and SARIF reporting at /enterprise. Either path is defensible. Doing neither — and continuing to ship agents with no adversarial coverage beyond a launch-day eval — is the path the Air Canada tribunal, the EchoLeak disclosure, and the IDC governance-lawsuit forecast have all already priced in.

Run an agent red team against your own system.

Start with AgentGuardian Open Source for a local baseline, or book a demo of AgentGuardian Enterprise to see the governed estate-wide view with signed evidence packs.