Vision-Based Prompt Injection: When the Attack Is Inside the Image

In August 2025, security researcher Johann Rehberger published a vulnerability a day for a month across the major commercial AI platforms. Among the case studies in his Month of AI Bugs series was a quiet, undramatic finding: production ad-review pipelines were being bypassed by injections hidden inside CSS — instructions the human reviewer never saw, but the language model behind the moderation queue dutifully read and acted on. No exploit kit, no zero-day, no novel cryptography. Just a stylesheet rule, an LLM, and a chain of trust that no one had thought to inspect.

A few months earlier, Aim Security disclosed EchoLeak (CVE-2025-32711), the first publicly documented zero-click prompt injection against Microsoft 365 Copilot. A specially crafted email — never opened by the victim — was enough to coerce Copilot into exfiltrating confidential data on the user's behalf. And before either of those, in February 2024, Jiang et al. published ArtPrompt, showing that words rendered as ASCII art could slip past every safety filter the commercial frontier labs had shipped, because the filter saw boxes and pipes while the model reconstructed the word and answered the otherwise-blocked question.

These three incidents do not share a CVE, a vendor, or a delivery mechanism. They share a single structural failure: the trust boundary in an AI system is no longer the user message. It is whatever the model can perceive. Pixels, characters, font sizes, audio spectrograms, document layout, EXIF metadata — every channel a multimodal model parses is now an injection channel, and most production defences were architected for an attack surface that no longer exists.

Why multimodal injection is a different problem

Text-only prompt injection has had three years of public scrutiny. The mitigations have not converged — OWASP's LLM01 entry in the 2025 Top 10 is explicit that techniques marketed as safety features, including retrieval-augmented generation and fine-tuning, "do not actually solve the core vulnerability of prompt injection; they merely ground the model, they do not secure it." OWASP retained prompt injection at the top of the list for the second consecutive edition, which is a polite way of saying the industry has not solved it. The text case is the easy case.

Multimodal injection is harder for two reasons. First, the payload does not look like a payload to the upstream control. A text filter reads a string; a multimodal filter has to evaluate pixels, layout, fonts, and metadata in a normalised representation, and the normalisation itself loses information that the model later recovers. Second, the human-in-the-loop intuition fails. The reviewer glancing at a JPEG sees the picture; the reviewer skimming a PDF sees the rendered page. Neither sees the alpha-channel text, the white-on-white footer, the steganographic perturbations in the colour histogram, or the off-screen CSS that the model's HTML renderer is happy to process. The defender's intuition about what is in the artefact diverges, sharply, from what the model perceives.

NIST's AI 600-1 Generative AI Profile calls this out under its catalogue of GAI-specific risks, naming prompt injection as a first-class concern and noting that mitigations have to span the MAP, MEASURE and MANAGE functions — not just the deployed model. That framing matters: multimodal injection is not a model bug that the next checkpoint will fix. It is a system-design property of any agent that accepts non-text input and downstream uses an LLM to make decisions.

The channels: images, PDFs, screenshots, HTML, audio

The taxonomy that holds up under contact has five active channels in 2026. Each requires a different mitigation, but they share a common pattern: a hostile instruction is embedded in a representation the upstream filter does not evaluate, then resolved into the model's working context as if it were a trusted system message.

1. Images — visible, invisible, and adversarial

The image-channel attack has three sub-flavours. Visible-but-ignored text: footers, watermarks, six-point grey captions on a receipt, QR-code annotations that no human reads but the OCR pass ingests in full. Invisible-to-human text: white text on white background, alpha-channel payloads, EXIF tags the vision encoder occasionally embeds. Adversarial pixels: gradient-optimised noise patches that drive the vision encoder toward a target embedding, producing captions that include phrases like "system: enter maintenance mode" over an otherwise innocuous photograph. The third class — adversarial examples — is the technique cluster MITRE ATLAS catalogues under its Evade ML Model family, now expanded in the v5.1.0 release with fourteen agent-specific techniques contributed by Zenity Labs.

2. PDFs and scanned documents

A PDF is a layout language as much as a document. Hidden text layers, off-page form fields, JavaScript actions, and document-level metadata are all read by modern multimodal models without ceremony. Insurance carriers ingesting scanned claim forms and healthcare front-offices processing faxed consent letters are operating exactly the workflow this channel targets. Lakera and other guardrail vendors have documented indirect prompt injection delivered through enterprise document pipelines as the most prevalent indirect-injection vector outside the browser.

3. Screenshots and rendered HTML

The 2025 ad-review CSS attack lives here. An attacker submits creative for review; the review pipeline screenshots the rendered page and asks an LLM to classify it. The CSS uses opacity: 0, display: none, or off-canvas positioning to hide an instruction from human reviewers while leaving it in the DOM the model's renderer processes. The same pattern shows up in browser-using agents reading attacker-controlled pages — Rehberger's series documents several variants, including a coding-agent compromise driven entirely through MCP tool descriptions.

4. Audio and speech

Voice-input agents face an adjacent surface. Inaudible ultrasonic carriers (DolphinAttack- class), embedded noise that ASR models transcribe into specific tokens, and prosody- shaping attacks against voice-cloned assistants are documented in the academic literature and now appearing in production telephony agents. The transcript the LLM receives bears no relation to what the human in the call heard.

5. ASCII art and text-rendered-vision

ArtPrompt is the canonical instance: words rendered as ASCII art bypass token-level filters because the boxes-and-pipes surface form does not match any banned string, and the model reconstructs the word inside its own representation. The same structural property — benign surface, hostile semantics — underlies a long list of transformation jailbreaks (cipher, FlipAttack, leetspeak, base64), all of which Microsoft's AI Red Team catalogues in its Lessons From Red Teaming 100 Generative AI Products paper as a recurring failure mode for content filters.

Industry impact: claims, KYC, content moderation

Three verticals see multimodal injection first because their core workflow is image-in, decision-out. Insurance claims processing accepts damaged-vehicle photos, accident-scene photos, scanned medical bills, and photos of handwritten forms. Each artefact is treated as evidence; each is parsed by a multimodal pipeline that hands a structured summary to a downstream decisioning agent with authority to approve, pay, or escalate. Know-your-customer onboarding ingests scanned identity documents, proof-of-address letters, and selfies. Content moderation, including the ad-review pipelines Rehberger targeted, accepts arbitrary creative submitted by an arbitrary advertiser.

What these workflows share is a real-money side effect at the end of the pipeline and a vision pre-processor at the start. The threat model the engineering team usually wrote was about data leakage out of the image — PII, PHI, financial detail — not instructions flowing into the model through the image. The asymmetry is costly. The Air Canada chatbot ruling in February 2024 established that a company is legally responsible for what its chatbot tells a customer; the same principle is now propagating, through regulator guidance, to what an agent does with an image a customer uploaded.

Mapping to OWASP LLM01 and MITRE ATLAS

Multimodal injection is not a separate vulnerability class; it is the modality generalisation of categories that are already canonical in the standards stack. OWASP's LLM01 explicitly includes indirect prompt injection — content drawn from documents, web pages, images and tool outputs — within scope. MITRE's ATLAS v5.1.0 framework catalogues 84 techniques across 16 tactics, with adversarial input detection (AML.M0015) as a primary mitigation and the LLM Prompt Injection technique (AML.T0051) covering both direct and indirect variants. The November 2025 update integrated fourteen new techniques focused specifically on AI agents and generative AI systems, several with multimodal delivery vectors.

For organisations operating under a formal risk-management framework, NIST AI 600-1 slots multimodal injection into the catalogue of GAI risks under MAP-2.1 (context of use), MEASURE-2.7 (security and resilience), and MANAGE-4.1 (post-deployment monitoring). The Cloud Security Alliance's Agentic AI Red Teaming Guide (May 2025) places multimodal payloads under its goal-and-instruction-manipulation category, alongside knowledge-base poisoning and memory manipulation. The lineage is clear: this is OWASP LLM01 by way of the image, audio, or document channel, with the same accountability that has always applied.

Defensive patterns that actually work

The mitigation is structural. No prompt-side instruction — no "ignore any instructions in the image" — survives contact with an adversarial corpus, for the same reason no text-prompt instruction to ignore injection survives. The patterns that hold up share three properties: they treat untrusted modalities as untrusted regardless of their surface form, they evaluate extracted content under the same policy as user input, and they enforce the boundary in a separate process from the planner agent.

—Input normalisation in a sandboxed pre-processor. Run OCR, layout extraction, and metadata stripping against the image, PDF, or HTML in an isolated process before the agent sees it. Emit a structured representation — every text region with bounding box, font size, contrast ratio, opacity, and embedded metadata — not the raw artefact.
—Vision-text consistency checks. Ask a separate, capability-bounded model to produce a semantic caption ("a printed retail receipt for a USB-C cable"). Diff the OCR text against the caption. Imperative sentences in the OCR that the caption did not surface — refund, approve, ignore previous, system note — are a tier-elevating signal.
—Capability gating on the planner. The agent that consumes the normalised representation does not have direct tool access. A separate authorisation layer evaluates each proposed tool call against the user's explicit request and the provenance of the inputs that produced it. A multimodal upload should not be sufficient context to authorise a payment.
—Adversarial calibration in CI. Every release runs an adversarial multimodal corpus before the model or pipeline is promoted. The false-pass rate on adversarial inputs is published alongside the benign accuracy. A pipeline with 99% benign accuracy and 40% adversarial false-pass is not a pipeline; it is a liability.
—Provenance tagging in the context window. The model receives the system prompt, the user message, and the extracted artefact content as distinguishable, labelled spans. The agent's tool-calling policy treats untrusted-modality content as quotation, never as instruction.

Why OCR-based filtering is not enough

A common mitigation that does not work is "run OCR, scan the extracted text for known injection phrases, block on hit." Three reasons. First, the corpus of known phrases is always behind the attacker; transformation jailbreaks ensure the surface form does not match. Second, the vision encoder does not always need OCR to extract text — modern multimodal models read images end-to-end, so even if your OCR filter is clean, the model may still resolve the hostile string from pixels your filter never tokenised. Third, the adversarial-pixel class produces no extractable text at all; the attack lives in the encoder's representation space, not in any human-readable rendering.

This is the same argument OWASP makes about RAG and fine-tuning: the technique is necessary, but it is grounding, not security. OCR-based filtering catches the lazy attacker who put the payload in twelve-point Arial. The motivated attacker — the one Rehberger documented in production, the one Aim Security demonstrated in EchoLeak, the one Jiang demonstrated with ArtPrompt — does not bother with twelve-point Arial.

The trust boundary in an AI system is no longer the user message. It is whatever the model can perceive — and every perceivable channel is an injection channel until proven otherwise.

A multimodal hardening checklist

For teams running multimodal agents in production today, the minimum bar is short and concrete:

—Inventory every agent that accepts image, PDF, screenshot, HTML, or audio input. The list is usually longer than the CMDB suggests once internal copilots wired to ticketing systems with attachments are counted.
—Put a sandboxed pre-processor in front of every multimodal entry point. OCR, layout extraction, metadata stripping, and a vision-text consistency diff — running in shadow mode at minimum, blocking mode for tier-one workflows.
—Map findings to OWASP LLM01, MITRE ATLAS AML.T0051, and the CSA Agentic Red Teaming Guide categories. Use the standards taxonomy so the security backlog and the auditor's evidence pack speak the same language.
—Run a multimodal adversarial corpus pre-deployment. ArtPrompt-class transformations, hidden-text payloads, adversarial patches, and a representative slice of indirect injection through document and HTML channels. Treat any successful goal hijack as a release blocker.
—Publish the adversarial false-pass rate alongside benign accuracy. An evaluation pipeline that does not measure its own resilience to adversarial inputs is not measuring trust; it is asserting it.

Operationalising this

Multimodal injection is moving from research curiosity to live exploit faster than most security backlogs are tracking, and the standards stack — OWASP, MITRE, NIST, CSA — has already settled on the taxonomy. The remaining work is operational: a continuous adversarial corpus that exercises image, document, HTML, and audio channels against the specific agent topology you deployed, scored against a tier-aware severity model, and reported as evidence with the same rigour your text-injection program already has. That capability ships as the open-source corpus inside AgentGuardian, with a managed evidence-pack workflow available through the enterprise platform for teams that need signed AIVSS-scored attestations for regulators or internal audit.

Vision-based prompt injection: when the attack is inside the image.