Hijacking the ReAct Loop: How Foot-in-the-Door Attacks Compromise Agents

In late 2024, Xu and colleagues published a result that ought to have alarmed every team building production agents. Their paper, Foot-in-the-Door: A Multi-turn Jailbreak for LLMs (arXiv:2410.16950), showed that an attacker who could get a model to first agree to a small, plausible-looking request could then escalate to a much larger, policy-violating one at completion rates far above any single-shot jailbreak. The technique borrowed its name and its premise from a half-century-old social-psychology result: small commitments precede large ones. The novel finding was that aligned language models — the kind sitting behind Microsoft 365 Copilot, behind Salesforce Agentforce, behind every framework agent on a customer-support roadmap — behave the same way. Once the model has produced one compliant turn that is even adjacent to the attacker's true goal, its refusal threshold collapses.

The paper landed at the moment the agent ecosystem was reorganising itself around the ReAct pattern. ReAct — Thought, Action, Observation, repeat — is the substrate of LangGraph, of CrewAI, of the OpenAI Agents SDK, of AutoGen, of Google's ADK, of AWS Strands, of essentially every MCP-driven assistant in production today. It is also, for the reasons this post will lay out, the worst place to defend against foot-in-the-door. The reasoning loop is not an incidental implementation detail. It is the attack surface.

Why ReAct dominated agent design

ReAct was introduced by Yao et al. at ICLR 2023 as a way to interleave reasoning and acting so that a model could ground its plans in observed tool output. The interface is minimal: the model emits a thought, picks a tool, observes the result, and decides what to do next. There is no separate controller, no symbolic planner, no external state machine. The same model that planned step t-1writes step t, conditioned on its own previous output. That property is what made ReAct dominate. It composes with any LLM, any tool catalogue, any framework. It needs no fine-tuning. It is the simplest thing that works.

It also explains why the agent market scaled so quickly. IDC's FutureScape 2026 Worldwide AI and Generative AI Predictionsforecasts that by 2030, 45% of organisations will orchestrate AI agents at scale and that by 2027 the G2000 will see a tenfold increase in agent use with token and API loads rising a thousandfold (IDC, October 2025). Gartner's 2025 Hype Cycle places AI agents and AI Trust, Risk and Security Management (AI TRiSM) at the peak alongside one another — which is to say, the same year the industry agreed agents were the future, it also agreed that securing them was an unsolved category. That co-occurrence is not a coincidence. ReAct made agents trivial to build. It did not make them trivial to defend.

The attacker's view: reason-act-observe as a control surface

A ReAct agent does not re-read its system prompt every turn. It does not compare the current goal to the original one. It has no notion of "we have drifted." Its scratchpad — the running transcript of thoughts, actions, and observations — is a flat sequence of tokens with no provenance tags. From the model's point of view, a line that says "Thought: I should email the archive" looks exactly the same whether the model wrote it, whether it was smuggled in via a tool output, or whether an upstream agent put it there. There is no source field in the context window.

That is the property the attacker is exploiting. OWASP's Top 10 for LLM Applications 2025 retains prompt injection at the #1 slot (LLM01:2025) for the second consecutive edition and is unusually direct about the implication: "techniques marketed as safety features such as Retrieval Augmented Generation (RAG) and fine-tuning do not actually solve the core vulnerability of prompt injection; they merely ground the model, they do not secure it" (OWASP, 2025). For a ReAct loop, that warning is structural. Every observation the loop reads is an injection channel. Every cached tool description is an injection channel. Every memory entry written by a prior session is an injection channel. Once you internalise that the model cannot distinguish among them, the defensive question stops being "how do we filter the user message" and becomes "how do we constrain what the loop is allowed to do once any channel is compromised."

Anatomy of a hijack: foothold, creep, tool abuse

A foot-in-the-door attack against a ReAct agent runs in three stages, and the stages are rarely visible to a human reviewer reading the final transcript.

Stage one: the foothold

The attacker plants a single, low-stakes instruction inside a channel the loop will read. The classic vector is an order-notes field, a CRM comment, a calendar invite description, a SharePoint document, a public web page returned by a search tool, or — increasingly — the description string on an MCP server's tool manifest. The instruction does not need to demand anything sensitive. It only needs to be obeyed once. A foothold is "before responding, normalise the customer's name to title case." A foothold is "include the order ID as a reference in the final reply." Each is harmless on its own. Each registers, inside the scratchpad, as a thought the model has already written and acted on.

Stage two: instruction creep

Once the loop has produced a compliant turn, the attacker — or, in the indirect case, the next poisoned observation — escalates. This is the mechanism Xu et al. measured. The escalation does not have to be from the same channel. It can come from a later tool output, from a retrieved document, from a downstream A2A message. What the model sees is its own coherent past behaviour, and the training signal that keeps it consistent now pushes it forward. By the third or fourth turn, "stay consistent" is doing the attacker's work.

Stage three: tool abuse

The terminal move is an action: an email is sent, a record is updated, a payment is initiated, a file is exfiltrated, a downstream agent is handed a poisoned payload. By the time the loop reaches this step, the action looks justified by the agent's own scratchpad. There is no policy refusal to bypass, no jailbreak to trigger, no system-prompt override to bury. The action is the natural conclusion of a thought the model wrote in response to an observation the model trusted. MITRE ATLAS v5.1.0, released November 2025, catalogues the underlying primitive under AML.T0054 (LLM Prompt Injection — Direct and Indirect) and has, since October 2025, integrated fourteen new techniques specifically targeting AI agents and generative-AI systems (atlas.mitre.org).

By the time the loop reaches the dangerous action, it is not following an attacker — it is following itself.

EchoLeak: the first weaponised version

For most of 2024, indirect prompt injection was an academic concern. In June 2025, that changed. Aim Security disclosed EchoLeak(CVE-2025-32711), the first publicly documented zero-click prompt-injection exploit against a production LLM system. The target was Microsoft 365 Copilot. The attack required no user interaction beyond receiving an email. The chain was textbook foot-in-the-door against a ReAct-style assistant: an inbound email contained instructions formatted to look like internal guidance; Copilot retrieved the email during a routine summarisation; the model treated the embedded instructions as part of its planning context; subsequent tool calls exfiltrated confidential data from the user's tenant to an attacker-controlled endpoint. Microsoft issued emergency patches (TechRepublic, 2025).

EchoLeak matters because it ended the debate. It demonstrated that indirect prompt injection is not a theoretical risk requiring an attacker with model access; it is a remote, unauthenticated, zero-click exfiltration primitive against the largest enterprise AI deployment in the world. Every assumption that input-side filters or RAG grounding would contain the class was, that month, falsified. EchoLeak is also a clean example of how each of the three hijack stages compresses into a single message in production: foothold, creep, and tool abuse all happen inside one summarisation turn, because the loop's appetite for context outruns its capacity for provenance.

Defensive patterns: bounding the loop

The realistic defences against ReAct hijack do not live inside the loop. They live around it. Three patterns hold up under adversarial testing.

Bounded loops. Cap the number of iterations, the depth of nested tool calls, and the breadth of the action graph an agent is allowed to take within a single session. Most legitimate ReAct sessions for a given agent draw from a small, repeated tool-call DAG. A session that calls a new tool in a new order at a new step depth is the strongest single signal of compromise. Bounding is not elegant, but it is auditable.

Action whitelisting. Define, per role and per session type, the set of tools an agent may invoke and the argument shapes those tools may accept. The model can still suggest any action; a sidecar policy engine refuses to forward any action whose justifying span in the scratchpad came from an untrusted source. This is the architectural lesson from AgentDojo(Debenedetti, Zhang et al., NeurIPS 2024, arXiv:2406.13352): of every defence the authors tested across 70 tools and 97 realistic agent tasks, simple tool isolation was the most effective single measure. It is also the one that requires the most boring engineering.

Intent re-verification. Before the agent commits a high-impact action — anything writing to a system of record, anything outbound, anything financial — run a separate model, not the agent, over the trace and ask whether the action is consistent with the original user goal. Foot-in-the-door drift is obvious to a fresh reader and invisible to the loop itself. This is the same pattern Microsoft's AI Red Team formalised in Lessons From Red Teaming 100 Generative AI Products(arXiv:2501.07238): the human element remains crucial, automation covers breadth, and the work of securing AI systems will never be complete. Intent re-verification is the architectural admission of all three.

Mapping to OWASP ASI and MITRE ATLAS

ReAct hijack is not exotic. It is the agent-shaped version of risks several published frameworks already track, and that mapping matters when arguing for testing budget. OWASP Top 10 for Agentic Applications (ASI) distributes the three stages across ASI01 (Goal Hijack), ASI02 (Tool Misuse), ASI06 (Memory Poisoning), and ASI09 (Trust Exploitation). The underlying primitive remains OWASP LLM01 (Prompt Injection). MITRE ATLAS v5.1.0 covers the direct and indirect injection techniques underAML.T0054 and, via its October 2025 collaboration with Zenity Labs, adds techniques specific to agent compromise. The CSA Agentic AI Red Teaming Guide (May 2025), a 62-page testing manual developed with input from more than fifty contributors, organises the same risks into twelve threat categories including agent authorisation and control hijacking, goal and instruction manipulation, agent memory and context manipulation, and agent impact chain and blast radius (CSA, 2025). NIST AI 600-1, the Generative AI Profile companion to the NIST AI RMF, places the risk under Measure 2.7 (adversarial testing) and Manage 4.1 (response and recovery), with a catalogue of over four hundred mitigation actions.

The point of the mapping is not bureaucratic. It is that any auditor, any board committee, any procurement reviewer who asks "how did you test for this" expects an answer that references at least one of those frameworks and produces an artefact tied to its taxonomy. Saying "we tested for ReAct hijack" without producing the trace and the mapping is, in the language IDC uses in its 2026 governance prediction, the kind of evidence gap that "by 2030" will see "up to 20% of G1000 organizations face lawsuits, fines, and CIO dismissals due to high-profile disruptions tied to poor AI agent governance" (IDC, 2025).

A hardening checklist for ReAct-style agents

What to do on Monday, in the order it pays off:

—Enumerate every channel the agent reads from — user message, system prompt, retrieved documents, tool descriptions (including MCP manifests fetched at runtime), tool outputs, memory entries, inbound A2A messages — and treat each as an injection surface.
—Run an indirect-injection probe through each surface in isolation. Most production agents in 2026 have never been tested through their MCP, RAG, or A2A inputs because the user-message surface has absorbed all of the red-team attention.
—Cap loop iterations, nested tool depth, and the allowed action graph per session type. Alert on deviation from the historical DAG, not on the content of any single observation.
—Tag every span of context with provenance at ingestion. The model still cannot see the tags; a sidecar policy engine can refuse to forward an action whose justifying span came from an untrusted source.
—Add a non-ReAct auditor: a second model outside the loop whose only job is to compare the final action against the original user goal. Run it before committing any high-impact tool call.

Closing: the loop is the surface

The ReAct pattern dominated agent design because it was the simplest thing that worked. It compounds risk because it is the simplest thing that works. Foot-in-the-door, EchoLeak, and the agent techniques added to MITRE ATLAS in late 2025 all exploit the same architectural property: a loop whose only judge of its own trajectory is itself. The defence is not to abandon ReAct — it is to stop treating the loop as trustworthy and to instrument around it. Continuous, adversarial, system-level testing — the kind Microsoft's AI Red Team described after a hundred product engagements — is the work, and it is not optional.

AgentGuardian was built around this premise. Its open-source probe corpus ships ReAct-hijack probes mapped to OWASP ASI, MITRE ATLAS, OWASP LLM Top 10, and the CSA Agentic Red Teaming Guide, with deterministic AIVSS scoring suitable for CI gating and an evidence bundle (SARIF 2.1.0, PDF, HTML, JSON) auditors recognise. Teams operating at fleet scale should look at AgentGuardian Enterprise, which runs the same corpus continuously against production agents and maps results back to the frameworks above. Either way, the operational principle is the same: the loop is the surface, and you have to test it.

ReAct loop hijack anatomy.