BUYER'S GUIDE — PROCUREMENT

Procurement questions for agent red-team vendors.

A practical question set for security buyers evaluating agent red-team vendors — coverage, methodology, evidence, framework alignment, and how to compare like for like with the open-source baselines.

2026-06-0211 min readBusiness

Forrester's Q4 2025 Wave for Privacy Management Software made a quiet but structural observation: privacy, data governance, and AI governance now sit on the same platform. The vendor consolidation the Wave documents is the surface artefact. The substrate is that procurement teams have been handed an AI security category they have never bought before, and they are being asked to write requirements in a language — agents, tool-calling, MCP, indirect prompt injection — that did not appear in any of last year's RFP templates.

Gartner's 2025 Hype Cycle for Artificial Intelligence reinforces the timing. AI agents and AI Trust, Risk and Security Management (AI TRiSM) have both surged to the peak; generative AI itself has slipped into the trough as the conversation moves from model selection to operational risk. IDC's FutureScape 2026 puts a number on the consequences: by 2030, up to twenty percent of G1000 organisations face lawsuits, fines, and CIO dismissals tied to poor AI agent governance. The category is real. The procurement playbook is not yet.

What follows is a question set — not a vendor scorecard, not a magic-quadrant slot, not another acronym soup. It is the five buckets a competent security buyer should be asking about, and what the answers look like when the vendor is serious.

Why standard cyber procurement templates miss the agent context

A well-run procurement organisation has standard requirements for an application security vendor: SAST/DAST coverage, OWASP Top 10 alignment, CVSS scoring, SARIF output, SOC 2 Type II, evidence retention, dwell-time SLAs on critical findings. Drop that template on an agent red-team vendor and it falls through. Agents have no static surface to scan; the attack surface is constructed at runtime from a prompt, a planner, a tool registry, a memory store, and a chain of model calls that may differ on every invocation. The DAST analogue is not a crawler; it is a swarm of adversarial agents.

Microsoft's AI Red Team distilled this in "Lessons From Red Teaming 100 Generative AI Products" (arXiv:2501.07238, January 2025). Their headline finding: AI red teaming is not safety benchmarking. A benchmark answers "how does this model score on a fixed test set." A red team answers "how does this system fail when somebody intelligent tries to break it." A vendor whose pitch deck conflates the two has either not read the paper or hopes the buyer hasn't. The first job of the procurement team is to enforce that distinction in writing.

The other distortion is severity. CVSS scores a vulnerability against an asset. Agents don't have a CPE; they have a tool registry, a memory store, and a planner that may rewrite both. The community has converged on AIVSS — an AI-system-aware analogue that scores findings against agent tier and capability rather than a static CVE. A procurement template that hard-codes CVSS as the only acceptable severity model will reject the vendors who are doing the work correctly.

Question set 1 — Coverage against the published taxonomies

The first thing to ask is which taxonomies the vendor maps to, and at what granularity. Two are non-negotiable. The OWASP Top 10 for LLM Applications 2025 covers the model layer (prompt injection retains the #1 slot for the second consecutive edition). The OWASP Agentic Top 10 covers the agent layer — goal hijack, tool misuse, supply-chain poisoning, memory poisoning, agent-to-agent compromise. The two are complementary, not interchangeable. A vendor that ships LLM probes only is testing the model; a vendor that ships agentic probes only is missing the prompt-injection substrate.

The third pillar is MITRE ATLAS, now at v5.1.0 (November 2025) with 16 tactics, 84 techniques, 56 sub-techniques, 32 mitigations, and 42 real-world case studies. In October 2025, ATLAS integrated 14 new techniques specifically focused on AI agents and generative AI systems — the "ATLAS for Agents" refresh produced with Zenity Labs. A finding without an ATLAS technique ID is a finding that does not reconcile with the rest of the security team's threat-model documentation. Ask for a SARIF sample. Inspect the rule tags. Every probe should carry an OWASP category and an ATLAS technique ID. Findings that arrive as free-text get re-categorised by hand at audit time.

The Cloud Security Alliance's Agentic AI Red Teaming Guide (May 2025) adds a hands-on testing dimension with twelve threat categories — including checker-out-of-the-loop, critical system interaction, knowledge-base poisoning, and multi-agent exploitation — that no vendor scope statement should omit.

Question set 2 — Methodology: red team, benchmark, or continuous testing

These are three different things and they cost different money. A red-team engagement is a time-boxed assessment by humans with creativity, intuition, and domain knowledge — Microsoft AIRT's lesson five: "the human element is crucial." A benchmark is a fixed corpus of probes producing a comparable score (the AdvBench / GCG benchmark family in Zou et al. 2023 remains the canonical reference). Continuous testing is what production agents need — a probe corpus that runs in CI on every change to the agent, the model, the tool registry, or the prompt.

Ask the vendor which of the three they sell. Ask whether the same methodology covers Tier 1 agents (autonomous, tool-calling, memory-bearing) and Tier 4 chatbots, or whether the same SKU is being marketed for both. Ask for a sample report from each tier. The shape of the report — not the cover page — is the honest answer.

WHAT TO LOOK FOR

A vendor that is explicit about which of the three modalities they deliver, with separate evidence templates and pricing for each. A vendor that markets all three with one undifferentiated SKU is selling a benchmark and calling it a red team.

Anthropic's Responsible Scaling Policy v3.0 and OpenAI's GPT-4 / successor system cards both publish their external red-team methodology in detail — OpenAI's engagement covered 100+ external red-teamers across 45 languages and 29 countries. Use those published methodologies as the bar. A vendor whose methodology is less inspectable than the frontier labs' is selling the wrong thing.

Question set 3 — Evidence: SARIF, AIVSS-style severity, audit packs

The deliverable matters more than the demo. A red-team engagement that produces a PDF and a Slack message is not procurement evidence; it is a meeting recap. The artefacts a serious vendor produces are:

A vendor whose only deliverable is a portal login is a vendor whose evidence will not survive contract renewal, let alone a regulator request. Ask for a sample evidence pack on disk. The shape of the directory is more informative than the contents of the slides.

Question set 4 — Independence, methodology disclosure, and pricing

The last two buckets travel together. Independence asks whether the vendor that runs your red team is also the vendor selling you the guardrail, the gateway, or the model. Bundled offerings have an obvious conflict-of-interest problem — the red team will not find what the guardrail is sold to block. NIST AI 600-1 and ISO/IEC 42001 both require independence between the testing function and the control function. If the vendor cannot articulate that separation, the procurement file cannot record it.

Pricing matters because continuous testing is not a one-time engagement. The Air Canada chatbot ruling (Moffatt v. Air Canada, BC Civil Resolution Tribunal, February 2024) established that companies bear responsibility for information their chatbots provide; the EchoLeak zero-click prompt-injection exploit against Microsoft 365 Copilot (CVE-2025-32711, disclosed June 2025) demonstrated that indirect injection has moved from academic concern to weaponised supply-chain attack. Neither case is solved by an annual scan. Ask for a pricing model that supports continuous testing — per-agent, per-tier, per-probe-execution, or a flat platform fee — and reject any model that punishes you for running the corpus more often than once a quarter.

Promptfoo's evolution makes the point. As an open-source LLM red-team framework — 13.2k GitHub stars, used by 300,000+ developers and 127 Fortune 500 companies, MIT-licensed, OWASP LLM Top 10 and NIST AI RMF aligned — it set the pricing baseline at zero. A commercial vendor charging an order of magnitude more for comparable coverage owes the procurement file an order-of-magnitude better answer.

Question set 5 — Comparing like for like against the OSS baselines

Every commercial agent red-team RFP should include three open-source baselines and require the vendor to explain what they add on top. The baselines are:

Add AgentDojo (Debenedetti et al., ETH Zurich, NeurIPS 2024) for agent-specific benchmarking — 70 tools, 97 realistic tasks, 27 injection targets — and you have a comparison frame that costs nothing and is methodologically rigorous. The commercial vendor's job in your RFP is to explain, line by line, what coverage, determinism, evidence, signing, or framework-mapping capability they deliver that the OSS baseline does not. The vendors who can answer that question deserve the renewal. The ones who cannot are selling packaging.

A one-page RFP supplement for agent security

What follows is the section a procurement team can paste into an existing application-security RFP. Every clause is tied to a question above and every clause is published-standard-anchored, not vendor-proprietary.

SECTION X.Y — AI AGENT RED-TEAM TOOLING REQUIREMENTS

X.Y.1  Taxonomy coverage. Each finding MUST carry an OWASP Top 10 for
       LLM Applications 2025 category, an OWASP Agentic Top 10
       category where applicable, and a MITRE ATLAS technique ID
       (v5.1.0 or current). Findings without all three are not
       acceptable.

X.Y.2  Methodology disclosure. The vendor MUST state in writing
       whether the engagement is (a) a human-led red team,
       (b) a benchmark corpus, or (c) continuous automated testing,
       and MUST price each modality separately. The methodology
       MUST be at least as inspectable as the published external
       red-team methodologies of Anthropic and OpenAI.

X.Y.3  Evidence format. Deliverables MUST include SARIF 2.1.0,
       a signed PDF/A-3, a JSON manifest, and SHA-256 hashes of
       every probe input and target response. Evidence MUST be
       writable to customer-resident storage.

X.Y.4  Scoring. Severity MUST be computed by a published,
       deterministic formula (AIVSS or equivalent). Two runs of
       the same probe set against the same target MUST produce
       bit-identical scores under controlled conditions.

X.Y.5  Framework alignment. The probe corpus MUST map to NIST
       AI 600-1 GAI Profile risks, ISO/IEC 42001 control areas,
       and CSA Agentic AI Red Teaming Guide categories.

X.Y.6  Independence. The vendor MUST disclose whether the same
       legal entity sells (or is bundled with) AI guardrails,
       runtime gateways, or model-provider services to the
       customer. NIST AI 600-1 independence expectations apply.

X.Y.7  OSS baseline comparison. The vendor MUST provide a
       coverage delta against PyRIT, Garak, Promptfoo, and
       AgentDojo, justifying the price premium per category.

X.Y.8  Pricing model. The commercial model MUST support
       continuous testing without per-execution penalties.
       Annual-only models will be rejected.

Where buyers can validate vendor claims

Three documents do most of the validation work. NIST AI 600-1 (the GAI Profile) lists the twelve generative-AI-specific risks and the 400+ mitigation actions tied to them — a vendor whose probe corpus does not cite back to that list is testing the wrong risks. The CSA Agentic AI Red Teaming Guide gives the twelve threat categories; a vendor coverage table that does not enumerate them by name is not yet standards-aligned. MITRE ATLAS gives the technique IDs and the case-study evidence; a finding without an ATLAS ID is a finding that has not been mapped to observed adversary behaviour.

Real incidents do the rest. The Samsung ChatGPT data-leak series (March 2023, three disclosures inside twenty days), the Chevrolet of Watsonville $1 Tahoe (December 2023), the NYC MyCity chatbot advising businesses to break the law (March 2024 reporting), the DPD chatbot swearing at customers (January 2024), the EchoLeak zero-click Copilot exfiltration (June 2025), and Johann Rehberger's "Month of AI Bugs" in August 2025 collectively form the case-study library against which any vendor's probe corpus can be checked. A vendor that cannot replicate the public exploits has not yet earned the renewal.

The right vendor will answer every question in writing, price the modalities separately, ship evidence to your storage, and treat the OSS baselines as the floor, not the ceiling.

Practical takeaway

Operationalising this is the easier half. The open-source toolkit ships pip install agent-guardian, runs the OWASP + ATLAS + CSA probe corpus locally with agent-guardian scan ./agent.py, and produces the SARIF, signed PDF, JSON manifest, and AIVSS posture score the RFP section above asks for. AgentGuardian Enterprise adds the discovery layer, the per-tenant signed evidence pack, and the regulator framework mapping. Same methodology, two surfaces — explore the methodology and probe corpus first at /open-source, then evaluate the managed evidence and audit surface at /enterprise.

NEXT STEP

Validate vendor claims against the OSS baseline.

The probe corpus, the AIVSS formula, the SARIF schema, and the evidence-bundle shape are all open and inspectable. Run them against your own agents first. Then compare what the commercial bids add on top.

Run the OSS baselineEvaluate the audit surface