From Vulnerability to Strategy: Defending LLMs Against Prompt Injection Attacks

Large language models (LLMs) have rapidly transformed how we work, but they’ve also introduced a fragile trust boundary that attackers are already exploiting.

One key risk is prompt injection, a fundamental vulnerability where LLMs don’t always know which instructions to trust. Attackers can exploit this weakness by crafting malicious prompts that override the model’s intended behavior.

In 2025, prompt-based attacks emerged as a leading AI exploit vector, accounting for about 35% of real-world security incidents with prompt injection ranking as the #1 risk in the OWASP Top 10 for LLM Applications.

How do Prompt Injection Attacks Work?

Prompt injection attacks take advantage of how LLMs are built to work. These systems are designed to be helpful and follow instructions expressed in natural language, but they often struggle to distinguish between safe and unsafe instructions. As a result, they can mistake malicious commands hidden within user input or retrieved content for legitimate directions. This makes it possible for attackers to steer the model in unintended ways, manipulating how it behaves, what it reveals, and how it interacts with other systems.

The two main delivery methods for a prompt injection threat are:

Direct prompt injection: This is one of the simplest ways to manipulate an LLM. This occurs when an attacker puts the malicious instructions directly into the prompt they submit to the model. One example is the classic “ignore previous instructions” pattern, which aims to override system instructions.
Indirect prompt injection: This is a more subtle and dangerous variant of prompt injection, where malicious instructions are embedded within external content that an LLM processes rather than being provided directly in the user prompt. The user may never see the injected instructions, but the model can still treat those instructions as legitimate. These inputs can include webpages, documents, emails, or resumes that appear benign on the surface but contain hidden instructions designed to manipulate the model’s behavior. According to the Microsoft Digital Defense Report 2025, “Indirect prompt injection attacks are particularly concerning for developers and organizations that rely on LLMs to process untrusted or user-generated content.”

Real-World Prompt Injection Attacks

Prompt injection attacks are happening every day, with real consequences for businesses and users. One of the earliest examples involved Microsoft’s Bing Chat, internally codenamed “Sydney,” an AI chat system that combined conversational capabilities with live search. Sydney relied on hidden system prompts and safety rules to safeguard its behavior. However, attackers discovered that carefully crafted prompts, such as instructions to “ignore previous instructions,” could bypass these safeguards, causing the model to reveal internal system guidelines that were never meant to be exposed.

In enterprise environments, attackers embedded malicious instructions in documents processed by RAG (retrieval-augmented generation) systems, causing models to leak sensitive business data, disable safety filters, or trigger unauthorized API calls. “Zero-click” prompt injections where AI assistants automatically process malicious content, quietly exfiltrating data from tools like Microsoft 365 Copilot also are becoming more common.

The risks are serious. Prompt injections can lead to data leaks, policy violations, regulatory breaches, and operational disruptions, especially when LLMs are integrated into core workflows. High-profile 2025 incidents — including poisoned RAG systems, zero-click attacks on AI copilots, exploits against coding assistants like GitHub Copilot Chat, and the CometJacking attack on Perplexity’s Comet AI browser — turned prompt injection from a theoretical risk into a major concern.

Why Traditional Cybersecurity Defenses Fail Against Prompt Injection

Prompt injection attacks are tricky because they target how LLMs process instructions, not software bugs or network vulnerabilities. Traditional defenses like firewalls and antivirus software are designed to block malicious code or files, not instructions hidden in plain text. Prompt injections manipulate the model’s reasoning directly, and AI can’t innately distinguish between legitimate and malicious instructions embedded in user input or external content.

Even content filters and keyword detection often fail. Attackers can disguise commands using formatting tricks or multi-turn prompts. Models that interact with tools, APIs, or databases can be tricked into performing actions because the instructions appear to come from “trusted” model behavior rather than an external exploit. Prompt injection attacks hit the cognitive layer of AI, not the technical layer conventional security protects.

Static defenses like hardened system prompts, hard-coded refusals, or post-processing filters offer only partial protection. Obfuscated language, encoding tricks, and clever multi-turn prompts can bypass them. As prompts become more complex to counter attacks, they can degrade performance and increase costs. Effectively safeguarding against a prompt injection attack requires an agile, adaptive security strategy, which is where purple teaming comes in.

How Purple Teaming Can Safeguard Against Prompt Injection Attacks

Purple teaming is a powerful approach to test and strengthen defenses against prompt injection attacks, closing the loop between offense, defense, and ongoing improvement. By combining red and blue team efforts, organizations can identify weaknesses, implement fixes, and verify that defenses actually work in real-world scenarios.

Red Team: Offensive prompt injection testing

Red teams simulate realistic attacks to uncover vulnerabilities in AI systems. They craft adversarial prompts designed to override system instructions, exfiltrate data, exploit connected tools or APIs, and bypass safety guardrails. These tests leverage real content channels, so the prompts reflect the kinds of inputs production AI assistants are likely to encounter, not just laboratory examples.

Essentially, red teams perform offensive security maneuvers, “ethically hacking” an organization using advanced capabilities, to identify gaps before malicious actors can exploit them. By recreating real-world attack scenarios, they provide actionable insights that help organizations strengthen defenses and reduce exposure to prompt injection attacks.

Blue Team: Defensive detection and validation

Blue teams focus on protecting and monitoring AI systems. They set up logging, anomaly detection, and monitoring around the LLM to track unusual behavior and suspicious activity. During red team exercises, blue teams validate whether defenses such as input filters, output classifiers, access controls, and alerting mechanisms trigger correctly and with sufficient accuracy.

Their goal is to catch vulnerabilities before they can be exploited in production. By defending against red team maneuvers and analyzing attack traces, blue teams continuously refine policies, improve detection rules, and strengthen overall system resilience. Essentially, they function as the guardians of AI workflows, ensuring that malicious instructions or unsafe outputs are identified and contained.

Purple Team: Collaborative defense and continuous improvement

A purple team brings red and blue teams together to work side by side, creating a continuous feedback loop between offense and defense. Every successful prompt injection discovered by the red team becomes a backlog item for new guardrails, policies, or tool permission changes, which are then retested until reliably blocked.

This approach ensures that defenses evolve alongside attacker techniques rather than lagging months behind. Purple teaming encourages a shared intelligence mentality: red teams identify where and how attacks could succeed, while blue teams provide insights on how to detect and block them effectively.

Developing a purple team mindset is a key goal in safeguarding against prompt injection attacks. By continuously integrating offensive findings and defensive responses, organizations can ensure that even advanced, subtle, or common attacks are contained before they cause harm.

Prompt injection is not a passing flaw. It exposes a deeper challenge in how LLMs interpret and prioritize instructions. Addressing this risk demands continuous validation, adversarial testing, and a security mindset that evolves alongside the technology itself. Organizations that treat prompt injection as an ongoing discipline, not a one-time fix, will be better positioned to deploy AI systems safely and at scale.

Off

Interested in Purple Teaming?

Learn More

From Vulnerability to Strategy: Defending LLMs Against Prompt Injection Attacks

How do Prompt Injection Attacks Work?

Real-World Prompt Injection Attacks

Why Traditional Cybersecurity Defenses Fail Against Prompt Injection

How Purple Teaming Can Safeguard Against Prompt Injection Attacks

Red Team: Offensive prompt injection testing

Blue Team: Defensive detection and validation

Purple Team: Collaborative defense and continuous improvement

Interested in Purple Teaming?

Heather Wiederhoeft

Privacy Policy

Cookie Policy

Terms of Service

Accessibility

Fortra AI Use

Impressum