Bloomberg Research: RAG LLMs May Be Less Safe Than You Think

Retrieval-Augmented Generation, or RAG, has been hailed as a way to make large language models more reliable by grounding their answers in real documents. The logic sounds airtight: give a model curated knowledge to pull from instead of relying solely on its own parameters, and you reduce hallucinations, misinformation, and risky outputs. But a new study suggests that the opposite might be happening. Even the safest models, paired with safe documents, became noticeably more dangerous when using RAG.

Researchers from Bloomberg AI, the University of Maryland, and Johns Hopkins conducted one of the first large-scale analyses of RAG systems’ safety. Their findings upend the common assumptions many AI developers and users hold about how retrieval impacts model behavior. Across eleven popular LLMs, RAG often introduced new vulnerabilities, creating unsafe responses that did not exist before.

Retrieval did not protect the models

In a test of over 5,000 harmful prompts, eight out of eleven models showed a higher rate of unsafe answers when RAG was activated. Safe behavior in the non-RAG setting did not predict safe behavior in RAG. The study provided a concrete example: Llama-3-8B, a model that only produced unsafe outputs 0.3 percent of the time in a standard setting, saw that figure jump to 9.2 percent when RAG was used.

Not only did the overall percentage of unsafe responses climb, but models also expanded their vulnerabilities across new risk categories. Previously contained weaknesses in areas like unauthorized practice of law or malware guidance spread into broader categories including adult content, misinformation, and political campaigning. RAG, instead of narrowing risk, broadened it.

Three reasons why RAG can backfire

The researchers traced this unexpected danger to three interlocking factors:

LLM Safety Baseline: Models that were less safe to begin with suffered the greatest deterioration in RAG settings.
Document Safety: Even when retrieved documents were classified as safe, models still generated harmful content.
RAG Task Performance: The way a model handled combining external documents with internal knowledge deeply influenced outcomes.

What emerged is that simply pairing a safe model with safe documents is no guarantee of safe responses. The mechanisms that make RAG appealing, such as context synthesis and document-guided answering, also open new pathways for misuse and misinterpretation.

Two main behaviors stood out when researchers analyzed unsafe outputs stemming from safe documents. First, models often repurposed harmless information into dangerous advice. For instance, a Wikipedia entry about how police use GPS trackers became, in the hands of a model, a tutorial for criminals on evading capture.

Second, even when instructed to rely solely on documents, models sometimes mixed in internal knowledge. This blending of memory and retrieval undermined the safeguards RAG was supposed to provide. Even when external documents were neutral or benign, internal unsafe knowledge surfaced in ways that fine-tuning had previously suppressed in the non-RAG setting.

Adding more retrieved documents only worsened the problem. Experiments showed that increasing the number of context documents made LLMs more likely to answer unsafe questions, not less. A single safe document was often enough to start changing a model’s risk profile.

Not all models handled RAG equally. Claude 3.5 Sonnet, for example, remained remarkably resilient, showing very low unsafe response rates even under RAG pressure. Gemma 7B appeared safe at first glance but deeper analysis revealed that it often simply refused to answer questions. Poor extraction and summarization skills masked vulnerabilities rather than fixing them.

In general, models that performed better at genuine RAG tasks like summarization and extraction were paradoxically more vulnerable. Their ability to synthesize from documents also made it easier for them to misappropriate harmless facts into unsafe content when the topic was sensitive.

The safety cracks widened further when researchers tested existing red-teaming methods designed to jailbreak LLMs. Techniques like GCG and AutoDAN, which work well for standard models, largely failed to transfer their success when targeting RAG setups.

One of the biggest challenges was that adversarial prompts optimized for a non-RAG model lost effectiveness when documents were injected into the context. Even retraining adversarial prompts specifically for RAG improved the results only slightly. Changing the documents retrieved each time created instability, making it hard for traditional jailbreak strategies to succeed consistently.

This gap shows that AI security tools and evaluations built for base models are not enough. Dedicated RAG-specific red-teaming will be needed if developers want to deploy retrieval-enhanced systems safely at scale.

Retrieval is not a safety blanket

As companies increasingly move toward RAG architectures for large language model applications, the findings of this study land as a stark warning. Retrieval does help reduce hallucinations and improve factuality, but it does not automatically translate into safer outputs. Worse, it introduces new layers of risk that traditional safety interventions were not designed to handle.

The takeaway is clear: LLM developers cannot assume that bolting on retrieval will make models safer. Fine-tuning must be explicitly adapted for RAG workflows. Red-teaming must account for context dynamism. Monitoring must treat the retrieval layer itself as a potential attack vector, not just a passive input.

Without RAG-specific defenses, the very techniques designed to ground language models in truth could instead create new vulnerabilities. If the industry does not address these gaps quickly, the next generation of LLM deployments might inherit deeper risks disguised under the comforting label of retrieval.

Featured image credit

Tags: llm RAG

Bloomberg research: RAG LLMs may be less safe than you think

Researchers discovered that safe models paired with safe documents can still generate harmful responses under RAG.

Related Posts

Nature study projects 2B wearable health devices by 2050

DeepSeek introduces Manifold-Constrained Hyper-Connections for R2

Imperial College London develops AI to accelerate cardiac drug discovery

DarkSpectre malware infects 8.8 million users via browser extensions

CMU researchers develop self-moving objects powered by AI

Glean’s Work AI Institute identifies 5 core AI tensions

LATEST NEWS

Xbox Developer Direct returns January 22 with Fable and Forza Horizon 6

Dell debuts disaggregated infrastructure for modern data centers

TikTok scores partnership with FIFA for World Cup highlights

YouTube now lets you hide Shorts in search results

Google transforms Gmail with AI Inbox and natural language search

Disney+ to launch TikTok-style short-form video feed in the US

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.