Why Single-Probe Red Teaming Misses Real Agent Threats

Lesson seven of Microsoft's AI Red Team retrospective, after auditing more than a hundred generative AI products, is the one that should sit on every CISO's desk: large language models amplify existing security risks and introduce new ones. The team's January 2025 paper is unusually blunt about what that means operationally — the failure modes that put production agents on the front page of trade press are not the failure modes a benchmark suite is designed to catch (Microsoft AIRT, arXiv:2501.07238).

The clearest recent evidence is EchoLeak, disclosed by Aim Security in June 2025 and tracked as CVE-2025-32711. A remote attacker sent a single email; the recipient's Microsoft 365 Copilot, executing a routine summary request hours later, retrieved that email, treated its hidden instructions as policy, walked through the user's mailbox and exfiltrated confidential data to an attacker-controlled endpoint — with zero clicks from the human in the loop. Microsoft issued an emergency patch. The chain spanned three primitives: indirect prompt injection at retrieval, identity confusion across an agentic boundary, and an output channel that the agent was allowed to call without consent (TechRepublic coverage; CVE-2025-32711).

No single probe, in any of the popular LLM scanners that shipped before EchoLeak landed, produced that chain. That gap is the subject of this piece.

Single-probe scanners optimise for known issues. Real incidents emerge at the intersection of three layers — and the intersection is exactly what a flat probe library cannot see.

Single-probe scanners are necessary, not sufficient

Two open-source tools have done more than any others to raise the floor on LLM testing. NVIDIA Garak, originally written by Leon Derczynski at ITU Copenhagen and transferred to NVIDIA in late 2024, is positioned in its docs as “nmap or Metasploit for LLMs”: it runs a battery of probes, detectors flag hits, and the tool produces a JSONL report with per-probe pass/fail rates. Microsoft's PyRIT, released as an open framework in February 2024 and used in 100+ internal AIRT operations against Copilots and Phi-3, automates adaptive prompt mutation against a target and scores responses (Garak repository; PyRIT announcement).

Both are excellent at what they do, and both share the same architectural assumption: the unit of test is a single prompt, a single response, a single judge call. Microsoft is explicit on this point — PyRIT is positioned as augmentation for, not replacement of, human-driven red teaming. The OpenAI external red-team programme, which has engaged more than 100 specialists across 45 languages and 29 countries for system-card evaluations, is built on the same assumption that automation handles breadth and humans handle composition (OpenAI's Approach to External Red Teaming).

For a chatbot — an LLM behind an HTTPS endpoint with no tools, no memory, no agentic boundaries — that decomposition is correct. The unit of test really is one prompt and one response. For an agent, it is not. An agent is a planner that calls tools, writes to retrieval stores, talks to other agents, and operates over many turns. Its security posture is a property of the system, not of any single inference. AgentDojo, the NeurIPS 2024 benchmark from ETH Zurich's SPY Lab, made this concrete: in a dynamic environment with 70 tools, 97 realistic user tasks, and 27 injection targets, current frontier LLMs solved fewer than 66% of benign tasks — and more capable models were often easier to attack, because they followed injected instructions more faithfully (Debenedetti et al., AgentDojo, arXiv:2406.13352).

What “composite” threats look like in production

The 2025 incident inventory is the easiest way to see how single-layer thinking fails. EchoLeak is the archetype, but it is not the only one. Security researcher Johann Rehberger's “Month of AI Bugs” in August 2025 published one critical AI-system vulnerability per day across major platforms, demonstrating a full AI Kill Chain from initial prompt injection to remote control. The same year saw persistent memory-poisoning attacks against Amazon Bedrock agents that survived session boundaries, real-world ad-review bypass via CSS-hidden injections observed in the wild, and coding agents fully compromised through MCP tool descriptions (embracethered.com).

Step through the EchoLeak chain again with that inventory in mind. The hostile email is, in OWASP's taxonomy, an indirect prompt injection — LLM01:2025, which OWASP kept at the number-one position for the second consecutive edition and pointedly noted is not solved by RAG or fine-tuning, which “merely ground the model, they do not secure it” (OWASP LLM01:2025). The agent's decision to act on the injected instruction crosses an identity boundary — it executed instructions that were never authorised by the user whose session it inherited. The exfiltration call is a tool-use failure, and the silent return to the user is a transparency failure. Four distinct categories; one incident; zero clicks.

The legal and reputational tail of these incidents is now established. In Moffatt v. Air Canada (February 2024), the BC Civil Resolution Tribunal rejected the airline's argument that its chatbot was a “separate legal entity” and awarded damages for negligent misrepresentation — the first common-law ruling that a company is liable for what its agent tells customers (case summary). Samsung Semiconductor banned generative AI tools company-wide in 2023 after three confidential-data disclosures in twenty days. NYC's MyCity chatbot was caught telling employers they could take workers' tips and landlords they could refuse housing-voucher tenants — both illegal. None of those would have shown up in a single-probe scan.

The case for specialist coverage

The Cloud Security Alliance's May 2025 Agentic AI Red Teaming Guide, developed with input from more than fifty contributors, organises the agentic threat surface into twelve categories: agent authorization and control hijacking, checker-out-of-the-loop, agent critical system interaction, goal and instruction manipulation, agent hallucination exploitation, agent impact chain and blast radius, agent knowledge-base poisoning, agent memory and context manipulation, multi-agent exploitation, resource and service exhaustion, supply-chain and dependency attacks, and agent untraceability (CSA, May 2025).

The thing to notice is not the count but the shape. Half of those categories are relational — they describe a behaviour that exists only between primitives. Multi-agent exploitation is not a property of any one agent. Impact chain and blast radius is not a property of any one tool. Memory and context manipulation cuts across retrieval, identity and output. A flat probe library can include a probe for each, but the probes do not compose with each other, because the harness has no shared state to compose them through.

The academic literature has been converging on the same conclusion. Zou et al.'s 2023 GCG paper (arXiv:2307.15043) and the AdvBench benchmark proved that universal adversarial suffixes transfer across aligned models; the follow-on work on multi-agent adversarial planners — the RedAgent line — demonstrated that coordinated specialists discover compromise chains that no single specialist can discover, because the chain requires a hypothesis formed by one agent to be available as a prior to another (Zou et al., 2023). That is the operational case for specialisation: not because the taxonomy has twelve or fourteen rows, but because the failure mode is intersectional and the harness has to model the intersections explicitly.

Coordination problems no one wants to talk about

The honest answer about multi-specialist red teaming is that the hard problems are not the attacks — they are the coordination overhead the design imposes on the team running it. Three problems are worth naming because they kill more multi-specialist programmes than threat-model gaps do.

Deduplication. Specialists running independently will surface the same underlying defect from three angles. An indirect-prompt-injection probe in a goal-hijack specialist, a content-bypass probe in a detection-evasion specialist, and a memory-poisoning probe in a knowledge-base specialist can all fire on the same hostile sentence in the same retrieved document. Without deduplication, the report becomes noise. With naive deduplication keyed on payload text, real chained findings collapse into a single low-severity entry. The right answer — payload-hash plus observation-type plus framework-boundary keying — is more engineering than the marketing literature suggests.

Prioritisation. A swarm produces dozens of findings on a routine scan; only a handful are worth a developer's next hour. AIVSS-style scoring — severity times tier weight, with chained findings carrying a multiplier — is the operationally honest answer, but it has to be calibrated against the deployment tier (a T1 agent with PII access and writable tools scores differently from a T4 prompt-only chatbot) and the scoring has to be deterministic if it is going to be used as a CI gate. Most published scoring schemes are not.

Evidence consolidation. Auditors and regulators are not interested in JSONL probe output. They want a per-finding artefact that names the standard it maps to, includes a reproducer, and carries enough context for an external reviewer to confirm the finding without re-running the attack. MITRE ATLAS v5.1.0, published November 2025, now contains 16 tactics, 84 techniques and 56 sub-techniques, including 14 new agent-specific techniques contributed by Zenity Labs (atlas.mitre.org). NIST AI 600-1, the Generative AI Profile of the AI RMF, catalogues more than 400 mitigation actions across the Govern/Map/Measure/Manage functions (NIST AI 600-1). A multi-specialist harness that does not map every finding to ATLAS techniques and NIST GAI risks is producing telemetry, not evidence.

From scan output to board-level posture

Gartner's 2025 Hype Cycle for Artificial Intelligence moved AI Trust, Risk and Security Management (AI TRiSM) to the peak of inflated expectations, alongside AI agents and AI-ready data — while generative AI itself slid into the trough. Gartner's framing is unsubtle: “AI brings new trust, risk and security management challenges that conventional controls don't address,” and conventional controls includes the LLM scanners that ran two years ago (Gartner on AI TRiSM). IDC's FutureScape 2026 forecast pushes the same point operationally: by 2030, up to 20% of G1000 organisations will face lawsuits, fines or CIO dismissals tied to high-profile disruptions caused by poor agent governance (IDC FutureScape 2026).

The bridge from scan output to board-level posture is the framework crosswalk. Every finding a multi-specialist harness produces should carry three tags: the OWASP category (LLM Top 10 or the agentic ASI series, depending on the failure's shape), the MITRE ATLAS technique ID, and the NIST GAI risk it maps to. That is what makes the evidence pack defensible. The CSA Agentic Red Teaming Guide is the operational manual; OWASP is the developer-facing taxonomy; ATLAS is the threat-modelling vocabulary; NIST is the governance frame. A scan artefact that speaks all four is auditable. A scan artefact that speaks one is not.

This is also the place where the “what does good look like” conversation has to happen. The scoring rollup should be reproducible, the reproducer for each finding should be self-contained, and the harness itself should be inspectable — ideally open-source — so that a security engineer who needs to know why a finding fired can read the code that fired it. Anthropic's Responsible Scaling Policy v3.0 now commits to publishing Frontier Safety Roadmaps and Risk Reports for the same reason: opacity in the evaluation harness is the most common reason a third-party evaluation gets discounted by a regulator (Anthropic RSP v3.0).

Practical takeaway

If your current agent assurance programme is built around a single-probe scanner, here is the shortlist of questions worth running this week:

—Does the harness model identity, memory, tool use, output and inter-agent communication as distinct surfaces, or does it flatten them into one prompt-response loop?
—When two probes fire on the same hostile payload at different points in the agent's execution graph, does the harness recognise the chain, or does it report them independently?
—Are findings tagged against OWASP LLM Top 10, MITRE ATLAS techniques, and NIST AI 600-1 risk categories at authoring time, or is the mapping retrofitted in a spreadsheet?
—Is the scoring rollup deterministic enough to gate a pull request on, and is the reproducer for each finding self-contained enough for a developer to confirm the fix?
—If a new threat model lands next month (the way A2A signed-message replay landed in early 2026), where does the new probe go, and how long is it before it ships in the default scan?

The operational shape of multi-vector assurance is the same shape Gartner is calling AI TRiSM and CSA is calling Agentic Red Teaming: continuous, framework-mapped, specialist-driven, with evidence that survives independent review. AgentGuardian is one open-source implementation of that shape; the design notes, the probe corpus, the AIVSS scoring formula and the Apache-2.0 distribution are at /open-source, and the managed evidence-pack and policy-enforcement layer is at /enterprise. Either route is a way to operationalise the argument above — the more important thing is that whatever harness you run, you stop treating the unit of test as a single prompt, and start treating it as the system the agent actually is.

The case for multi-specialist agent red teaming.