The Open-Source Agent Red-Team Stack: PyRIT, Garak, Promptfoo, AgentDojo

In November 2024, NVIDIA quietly took over stewardship of Garak — the open-source "nmap for LLMs" originally written by Leon Derczynski at ITU Copenhagen. Sixteen months later, in March 2026, OpenAI acquired Promptfoo, the MIT-licensed evaluation framework already in use at more than 127 Fortune 500 companies. In the interval, Microsoft published Lessons from Red Teaming 100 Generative AI Products; ETH Zürich's SPY Lab released AgentDojo at NeurIPS; the Cloud Security Alliance shipped a 62-page agentic red-teaming guide; and MITRE ATLAS — modelled after ATT&CK — grew to 16 tactics and 84 techniques, with 14 of those techniques added in late 2025 specifically to cover AI agents and generative systems.

The open-source LLM security stack has moved from research project to industrial supply chain. The capability map matters more than ever — because the tools are no longer interchangeable, the consolidation has reshaped who owns the roadmaps, and the gap between what these tools test and what production agents actually do has widened. This is a working map for engineers and risk owners picking, combining, or replacing tools in 2026.

Every tool on this map was built for a slightly different unit of test. The hard part is not picking one — it is knowing which threats fall through the seams when you stitch several together.

PyRIT: adaptive prompt probes, and where they stop

Microsoft's Python Risk Identification Toolkit shipped in February 2024 as the first widely-used open-source LLM red-team framework, drawn from the Microsoft AI Red Team's hands-on work on 100+ generative products, including Copilots and the Phi-3 release. PyRIT is built around an Orchestratorprimitive: a sequencer that owns a target, a list of seed prompts, an optional converter that mutates payloads (Base64, ROT13, persona-wrapping), and a scorer that classifies responses. Microsoft positions PyRIT explicitly as augmentation rather than replacement for manual red teaming, and the framework supports thousands of malicious prompts per category with an adaptive loop that changes tactics based on target responses (github.com/Azure/PyRIT).

PyRIT's strength is the orchestration substrate — pluggable targets, pluggable converters, pluggable scorers, and a SQLAlchemy-backed conversation memory that keeps every probe-and-response round reproducible. Its weakness is the shape of the unit of test. The orchestrator sees a single endpoint, fires single-turn probes, and judges responses in the same loop. Multi-turn chains where turn 4 depends on a memory write at turn 3 require a hand-rolled orchestrator subclass per chain. Agent shape — whether the URL is a chat completion, a LangGraph supervisor, or an MCP-fronted RAG agent — is not visible to PyRIT. And the scorer co-resides with the attack loop, which means the same memory and prompt surface are available to both attacker and judge.

Microsoft's own retrospective is candid about these limits: "the work of securing AI systems will never be complete," one of eight lessons distilled from the AI Red Team's product assessments (arXiv:2501.07238). PyRIT belongs in the stack — it remains the cleanest reference for the orchestrator pattern — but treating it as a complete agent-security solution is a category mistake.

Garak: "nmap for LLMs," and what the probe taxonomy buys

NVIDIA Garak — Apache-2.0, originally released by Leon Derczynski in June 2023 and transferred to NVIDIA in November 2024 — took a different cut. Where PyRIT's value lives in the orchestrator, Garak's lives in the probe catalogue. A Garak run is structured as probes × detectors × generators: each probe class produces a family of payloads (encoding attacks, glitch tokens, leakage probes, jailbreak templates), each detector classifies the response, and each generator is a thin wrapper around a target API. Fujitsu Research's independent 2024 review ranked Garak the leading LLM vulnerability scanner (garak.ai).

The probe catalogue is genuinely powerful for chatbot evaluation. The 100-plus probes that ship with Garak cover a lot of the surface in OWASP's Top 10 for LLM Applications and parts of MITRE ATLAS, and the framework makes it cheap to add new probes — one Python class, orchestration for free. Garak also explicitly separates detectors from generators, which is a small but real step toward structural judge-injection resistance compared to PyRIT.

Garak hits its ceiling at the agent threat model. A probe takes a target and returns a list of payloads; the framework calls the target once per payload and asks a detector to classify the output. There is no notion of intermediate state, no shared memory between probes, no mechanism for one probe's observation to inform another's payload. That is sufficient for testing whether a model refuses to write malware. It is insufficient for testing whether a LangGraph agent can be coerced into invoking send_emailwith arguments derived from a different tenant's retrieval index — a chain that real attackers compose every day and that 2025's EchoLeak (CVE-2025-32711) zero-click exploit against Microsoft 365 Copilot demonstrated at scale.

Promptfoo: CI-friendly evaluations and the OWASP mapping

Promptfoo redteam — MIT-licensed, 13.2k GitHub stars, used by 300,000+ developers and 127 Fortune 500 companies — made the move that turned LLM red-teaming from a research exercise into a CI gate. The framework ships declarative YAML configs that fit a promptfoo eval invocation the same way a unit-test runner does, supports more than 50 vulnerability types, and is explicitly mapped to OWASP's Top 10 for LLM Applications and the NIST AI Risk Management Framework Generative AI Profile (promptfoo.dev/docs/red-team). OpenAI's March 2026 acquisition kept the project open-source and intensified the OWASP mapping work, which matters when OWASP still places prompt injection at LLM01:2025 — the #1 risk for the second consecutive edition, with the explicit caveat that "techniques marketed as safety features such as Retrieval Augmented Generation and fine-tuning do not actually solve the core vulnerability of prompt injection" (OWASP LLM01:2025).

Promptfoo's contribution to the stack is procedural rather than architectural. It proved that taxonomy-first reporting and CI-shaped invocation matter as much as the probe library itself. The architectural unit of test is still the model — a chat completion behind an endpoint — and the framework treats the agent's tools, memory, and inter-agent protocol as opaque. That is a fine fit for evaluating model releases and prompt-template changes in a pipeline; it is a poor fit for evaluating an agent that calls fifteen tools and writes to a shared vector store. Knowing which question you are answering is the prerequisite for picking the tool.

AgentDojo: the first reproducible agent-security benchmark

AgentDojo, published by Debenedetti, Zhang et al. at ETH Zürich's SPY Lab and accepted at NeurIPS 2024, is the first rigorous, reproducible methodology for measuring LLM-agent security. The environment ships 70 tools, 97 realistic user tasks, and 27 injection targets across simulated email, banking and communication platforms. Headline findings from the paper are uncomfortable: current frontier LLMs solve fewer than 66 percent of even benign agent tasks; more capable models are often easier to attack, not harder; and simple tool isolation outperforms most prompt-side defences (arXiv:2406.13352). The benchmark is now used by both the US AI Safety Institute and the UK AISI in their published evaluations, including the disclosure of Claude 3.5 Sonnet's vulnerability to indirect prompt injection.

AgentDojo is not a scanner. It is a benchmark — a frozen environment that lets you compare agents and defences apples-to-apples. That is exactly what the rest of the stack lacks. Garak, PyRIT and Promptfoo each measure their own findings against their own probe libraries, which means a 92-percent pass rate from one tool is not commensurable with a 92-percent pass rate from another. AgentDojo gives the field a fixed point of reference against which any tool — open-source or commercial — can be calibrated. The price is that AgentDojo's environment is simulated; it cannot replace probes fired against your actual production agent.

The long tail: DeepTeam, Inspect, and what they cover

Beyond the four marquee tools, the open-source stack now includes DeepTeam (Confident AI / DeepEval team, Apache-2.0) for structured OWASP LLM Top 10 reporting; the UK AISI's Inspect framework for evaluation infrastructure including agentic tasks; the academic GCG attack and AdvBench benchmark from Zou et al. for gradient-guided universal adversarial suffixes that transfer across closed models (arXiv:2307.15043); and Cloud Security Alliance's 62-page Agentic AI Red Teaming Guide organising risks into 12 threat categories — agent authorisation hijacking, checker-out-of-the-loop, memory and context manipulation, multi-agent exploitation, supply-chain attacks, agent untraceability, and others (cloudsecurityalliance.org).

The pattern in the long tail is convergence on shared taxonomies — OWASP LLM Top 10, MITRE ATLAS, NIST AI 600-1, the CSA guide — and divergence on the testing primitive itself. That divergence is where the seams open.

The gaps: multi-agent, memory poisoning, MCP, A2A trust

The unfortunate truth about the 2026 open-source stack is that the threats most likely to land in production are the ones the stack covers least well. Four observable gaps:

—Multi-agent compromise. No widely-used open-source scanner models A2A (agent-to-agent) trust as a first-class concept. Real LangGraph and CrewAI deployments include supervisor-worker patterns where a hijacked supervisor signs payloads the worker accepts. Probes that test the supervisor and the worker in isolation produce green checks; the compromised chain ships.
—Memory poisoning that survives the session. Embrace The Red's August 2025 "Month of AI Bugs" series documented persistent memory-poisoning attacks against Amazon Bedrock agents that survive across sessions. Garak and PyRIT can fire memory-write probes, but neither has the abstraction to observe a payload written in one session and triggered ten sessions later.
—MCP tool description rug-pulls. Coding agents have already been fully compromised through Model Context Protocol tool descriptions — a tool registers innocuously, the LLM reads the description as instructions, and the supply-chain attack lands. No open-source scanner ships a probe class for adversarial tool descriptions as of mid-2026.
—Indirect prompt injection in the wild. NIST has called indirect prompt injection "generative AI's greatest security flaw," and the EchoLeak zero-click against Microsoft 365 Copilot showed weaponisation. Promptfoo and Garak both fire prompt-side injection probes; neither models the document-ingest, web-render, or email-render channels that real exploits use.

How mature teams are stitching the stack together

The pattern that has emerged in teams running serious agent-security programmes — at frontier labs, at large banks with deployed copilots, and in the FedRAMP-adjacent government contractors — is layered rather than single-tool. A representative stack looks like this:

—Promptfoo or DeepTeam as the CI gate. Declarative YAML, OWASP LLM Top 10 mapping, pipeline-shaped invocation. Runs on every pull request that touches a prompt template or model selection.
—Garak as the chatbot-tier scanner. Run against every model release and every routed-endpoint change, with the JSONL report archived as the per-release evidence artefact.
—PyRIT for bespoke adversarial campaigns. Used by the in-house red team to construct multi-turn chains that don't fit a declarative probe shape — judge-of-judge experiments, custom converters, long-tail leakage.
—AgentDojo as the calibration benchmark. Run quarterly against your deployed agent stack to produce a number commensurable with public reporting, including the US/UK AISI evaluations.
—A purpose-built agent scanner — open-source or commercial — for the threats that fall through the seams: A2A compromise, persistent memory poisoning, MCP tool-description supply chain, judge-injection against the evaluator itself.

Two observations on this stacking. First, the layering is real work — somebody on the security side has to own taxonomy reconciliation between OWASP LLM Top 10 and the new OWASP Agentic Top 10, between MITRE ATLAS v5.1 and the CSA threat categories, and between whatever numerical findings each tool emits. Second, Gartner's 2025 Hype Cycle places AI Trust, Risk and Security Management (AI TRiSM) at the peak of inflated expectations specifically because the layering problem has not yet been solved at the platform level. IDC's FutureScape 2026 prediction that "by 2030, up to 20 percent of G1000 organisations will face lawsuits, fines, and CIO dismissals due to high-profile disruptions tied to poor AI agent governance" is the cost of getting the layering wrong.

Practical takeaway: a starter stack for a small AI security team

—If you are evaluating model releases, start with Promptfoo. OWASP LLM Top 10 mapping out of the box, CI-shaped invocation, the strongest community at the moment, and the OpenAI acquisition has accelerated rather than slowed the roadmap.
—If you are stress-testing a chatbot or any single-model endpoint, add Garak. The probe catalogue is unmatched for that shape of target, and the NVIDIA stewardship has improved the release cadence.
—If you are running an internal red team building custom adversarial campaigns, keep PyRIT in the toolkit. It is the cleanest orchestrator substrate for one-off Python work, even where the project has gone quiet in 2026.
—If you ship an agent — tools, memory, planner, A2A — none of the above is sufficient on its own. Treat AgentDojo as your benchmark, the OWASP Agentic Top 10 (ASI01–ASI10) as your taxonomy, and budget for an agent-specific scanner.
—Run all of the above with a known calibration baseline. Microsoft's lesson #4 from the 100-product retrospective is that "automation can help cover more of the risk landscape," but lesson #5 is that "the human element is crucial." Tooling without a human in the loop produces false confidence.

Operationalising this

The open-source primitives above are necessary and insufficient. The architectural gap they leave — multi-agent compromise, persistent memory poisoning, MCP supply-chain attacks, judge-injection — is what agent-specific scanners are built to close. AgentGuardian is the Apache-2.0 distribution of that pattern: a fourteen-specialist adversarial swarm mapped to OWASP's Top 10 for Agentic Applications 2026 (ASI01–ASI10), MITRE ATLAS, and the CSA Agentic AI Red Teaming Guide, with a deterministic AIVSS score per finding. The open-source release is at /open-source and on PyPI as agent-guardian; the managed-governance version, including the ISO/IEC 42001 evidence pack and CI-gate observability, is at /enterprise. The ideal starting point is to run it against an agent you already control and compare the chain-level findings to whatever your current open-source stack produces.

The open-source agent red-team stack: a capability map.