Tiering Your AI Agent Estate: A Risk-Based Approach to Test Coverage

On August 2, 2026, the bulk of the obligations in the European Union's Artificial Intelligence Act for high-risk AI systems become enforceable, including conformity assessment, post-market monitoring, technical documentation, and incident reporting under Articles 9 through 17 (artificialintelligenceact.eu — implementation timeline). The deadline arrives at a moment when enterprises are no longer running one or two agents in production. They are running dozens — and in some cases hundreds — across customer service, code generation, claims adjudication, internal search, RPA replacements, and analyst workbenches. The compliance, security, and AI risk functions inside those enterprises share a problem they have not yet admitted out loud: there is no realistic budget under which every agent can be tested to the same depth.

The pragmatic response is a tiering model — an inventory-level classification that decides, before any probe is selected or any assessor is hired, how deeply a given agent should be tested, how often, and against which threat catalogue. A T4 chatbot that summarises a public FAQ does not require the same treatment as a T1 agent that calls a payments API on behalf of a retail customer. Banks have done this for decades with model risk under the U.S. Federal Reserve's SR 11-7 guidance; AI governance teams now need an equivalent discipline for agents.

Four tiers, defined by what the agent can touch

The model that has held up across our reviews of enterprise agent estates is a four-tier one, defined not by sophistication or by model size but by reach — the union of data sensitivity, blast radius, autonomy, and audit obligation. Each tier is a structural property of the agent's wiring, recorded once in the inventory and reviewed when the wiring changes.

Tier	Profile	Reach	Worst-case effect
T1	Regulated / customer-impacting	Real-money tools, regulated data, externally-facing autonomy	Notifiable data event; statutory liability; regulator notification
T2	Customer-facing, non-regulated	Tools and memory; customer interaction without regulated data	Brand harm, contractual misstatement, mis-selling claim
T3	Internal, action-capable	Tools on internal systems; build pipelines, ITSM, dev productivity	Operational incident, regression shipped, supervised rollback
T4	Experimental / read-only	Sandboxed, content-only, no production tool reach	Content quality issue; no side effect beyond the response

A small number of structural questions place an agent in a tier without argument: does the agent call tools that move money, change a customer's account state, or alter a regulated record? Does it process or retrieve personal data under GDPR, APRA CPS 234, MAS TRMG, or equivalent? Is it customer-facing without a human reviewer in the loop? Does it persist memory across sessions? Any single "yes" to the first two pulls an agent into T1; a "yes" on customer-facing but non-regulated reach places it in T2; tool reach with internal-only blast radius lands T3; everything else is T4.

How tiers map to the EU AI Act, ISO 42001 and SR 11-7

The point of the tier is not vocabulary — it is leverage on the three frameworks an enterprise has to satisfy in parallel. The EU AI Act categorises systems as prohibited, high-risk, limited-risk, or minimal-risk; the high-risk category includes systems used in credit scoring, employment, essential public services, law enforcement, and several Annex III sectors. ISO/IEC 42001:2023 — the first international management system standard for AI — requires an organisation to identify, assess, and treat AI risks via documented controls (iso.org — ISO/IEC 42001:2023). NIST's Generative AI Profile (NIST AI 600-1, July 2024) adds 12 GAI-specific risks and roughly 400 mitigation actions mapped to GOVERN, MAP, MEASURE, and MANAGE (nist.gov — AI 600-1). The Federal Reserve's SR 11-7 guidance on model risk management has been used by banks since 2011 to define proportional model validation (federalreserve.gov — SR 11-7).

A clean four-tier model maps to all three with no contradictions. T1 agents are presumptively in scope for EU AI Act Annex III high-risk obligations, require an ISO/IEC 42001 impact assessment under Annex B.6.1.2, and correspond to SR 11-7 high-materiality models with full annual independent validation. T2 agents are usually limited-risk under the AI Act (subject to transparency obligations under Article 50) but inherit B.6.1.2 and a moderate SR 11-7 validation tier on account of customer exposure. T3 and T4 fall into limited or minimal-risk under the AI Act and require a lighter-touch evidence pack — although T3, by virtue of acting on internal systems, still requires verification testing under ISO/IEC 42001 B.6.2.3.

Test cadence and coverage per tier

With the tier assigned, the test plan becomes a derivation rather than a debate. Microsoft's AI Red Team, after testing more than 100 generative AI products, reframed red teaming as a continuous, system-level practice rather than a single point-in-time engagement (arXiv:2501.07238 — Lessons from Red Teaming 100 GenAI Products). That reframing only scales if the cadence is matched to risk. Running the full OWASP LLM Top 10 corpus, MITRE ATLAS techniques, and the CSA Agentic AI Red Teaming Guide's twelve threat categories against every agent every sprint is neither economically nor organisationally feasible. The cadence below is the one we have seen survive contact with both engineering teams and second-line risk:

—T1: full OWASP LLM Top 10 + MITRE ATLAS + CSA Agentic Red Teaming corpus on every release candidate; continuous runtime monitoring; quarterly external adversarial assessment; annual independent validation report; AIVSS gate at every deploy.
—T2: OWASP LLM Top 10 + indirect prompt injection corpus on every release; monthly automated regression; runtime monitoring of customer-facing interactions; SARIF evidence retained for two release cycles.
—T3: prompt-injection, tool-misuse, and agent-to-agent probes on a release cadence; quarterly broader scan; runtime logging without policy enforcement; evidence retained for the next audit window.
—T4: smoke-test corpus on first release and whenever the system prompt or model changes; annual refresh; informal evidence retention.

The cadence is asymmetric on purpose. The marginal probe budget is concentrated where the marginal harm is — which is also where the EU AI Act's technical documentation, post-market monitoring, and incident reporting obligations bite hardest.

A worked example: how Air Canada's chatbot would tier

In February 2024, the British Columbia Civil Resolution Tribunal ruled that Air Canada was liable for negligent misrepresentation after its customer-service chatbot incorrectly told passenger Jake Moffatt that he could claim a bereavement-fare refund retroactively. The tribunal rejected Air Canada's argument that the chatbot was a "separate legal entity" and ordered the airline to honour the misrepresented policy plus interest (American Bar Association — Moffatt v. Air Canada). The amount was small. The precedent was not.

Under the four-tier model, this chatbot would have been a clear T2: customer-facing, externally exposed, capable of binding the company to policy assertions, but not directly invoking a payments tool or processing regulated personal data beyond what a booking already exposed. The required cadence at T2 would have included an indirect-prompt-injection corpus run against the chatbot's grounding documents, an output-fidelity test against the airline's actual fare policy document, and a grounded-citations regression on every prompt change. None of these are exotic; all of them would have surfaced the hallucination that the tribunal found dispositive. A T4 classification — "it's just a chatbot" — would have been the implicit, unwritten decision that produced the liability. The tiering exercise forces that decision into the inventory where it can be challenged.

The point of the tier is to make the decision "this is just a chatbot" legible — and therefore challengeable — before the tribunal makes it for you.

Crosswalk to SR 11-7 model risk tiering

Financial-services CISOs reading this will recognise the structure. SR 11-7 tiers models by materiality — broadly, by the magnitude of the decisions a model influences and the financial statement line items it touches — and prescribes proportional validation: independent challenger models, ongoing performance monitoring, and annual reviews for high-materiality models; lighter touch for low-materiality ones. The four-tier agent model is the natural extension: T1 agents correspond to high-materiality SR 11-7 models and inherit the same validation expectations; T2 and T3 correspond to medium-materiality models with lighter independent review; T4 corresponds to low-materiality models subject to inventory tracking and periodic recertification.

The crosswalk lets a bank treat agentic AI under existing model risk machinery rather than constructing a parallel framework. That is not a small convenience: SR 11-7 has fifteen years of board-level habituation behind it, and the AI Act's requirements for human oversight (Article 14), accuracy (Article 15), and risk management (Article 9) sit much more comfortably inside SR 11-7's validation taxonomy than inside a new AI-only one. The same logic applies in non-bank enterprises through ISO/IEC 42001's AI management system once it is in place.

A tiering questionnaire for an agent estate

The tiering exercise is short. The questionnaire we hand to platform teams takes about ten minutes per agent and answers a small set of structural questions whose union resolves to a tier:

—Does this agent call a tool that moves money, changes a customer-account state, or alters a regulated record? Yes → T1.
—Does this agent process or retrieve personal data under GDPR, APRA CPS 234, MAS TRMG, RBI FREE-AI, or equivalent? Yes → T1 (or T2 if read-only and ephemeral).
—Is this agent customer-facing without a human reviewer on every response? Yes → at least T2.
—Does this agent invoke tools on internal systems (CI/CD, ITSM, observability, source repos)? Yes → at least T3.
—Does this agent persist memory across sessions, including vector stores and RAG corpora the agent itself writes to? Yes → escalate one tier above the answer above.
—Is the agent subject to EU AI Act Annex III, an MAS notice, APRA prudential standard, or sector-specific AI guidance? Yes → T1, with documented impact assessment.

The questionnaire is not the end of the process — it is the input to the inventory. The inventory is what the auditor reads. An enterprise that can produce, in one query, the list of every agent in production by tier, the most recent assessment date per agent, the AIVSS posture score, and the responsible owner has already absorbed most of the documentation burden of EU AI Act Article 11, ISO/IEC 42001 clause 7.5, and SR 11-7's model-inventory expectation. The tier is the join key.

Tiers are not static

One closing point about discipline. A tier is a structural decision about an agent at a moment in time. Agents change. A T3 internal coding agent that gains the ability to file pull requests to customer-facing repositories has, by that change, moved to T2 — and the change should automatically trigger the higher cadence. The inventory needs a tier-review trigger on every meaningful change to scope, tools, memory, or data access. Without that, the tier becomes a stale label and the tiering exercise becomes theatre. The simplest enforcement is to bind tier-change reviews to the same change-management process that already gates production deployment; the worst time to discover an agent has graduated to T1 is during an incident.

Practical takeaway

—Build an agent inventory before building any new red-team capability. Every agent in production gets a tier, an owner, a most-recent-assessment date, and a posture score.
—Use four tiers, defined by reach (data sensitivity, blast radius, autonomy, audit obligation). Resist the urge to invent more — three is too coarse, six is unmanageable.
—Treat the tier as the join key across EU AI Act classification, ISO/IEC 42001 controls, NIST AI 600-1 MEASURE/MANAGE actions, and SR 11-7 materiality. Map once; cite many times.
—Make tier review a mandatory step in the change-management process for any agent whose tools, memory, or data access changes — not an annual exercise.
—Concentrate the marginal probe budget on T1 and T2. Continuous coverage at the top of the estate beats uniform-but-shallow coverage everywhere.

Operationalising this

The four-tier model is a methodology, not a tool — any organisation can adopt it on a whiteboard. Most stop there because the operational lift of running tier-appropriate probes, generating SARIF evidence, and feeding the inventory is where the model collapses in practice. AgentGuardian Open Source is the workbench we ship to close that gap: an agent inventory with tier tagging, the OWASP LLM Top 10 / MITRE ATLAS / CSA Agentic Red Teaming corpora wired to tier-aware probe selection, AIVSS scoring, and SARIF output that feeds existing audit pipelines. The Apache-2.0 build is available at /open-source for teams that want to run it locally; the SaaS at /enterprise adds continuous monitoring, the NIST AI 600-1 and ISO/IEC 42001 crosswalks, and the regulator-facing rollup that turns the tiered inventory into a defensible posture statement.

A tiering model for agent attack surface.