Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Bloomberg research: RAG LLMs may be less safe than you think

Researchers discovered that safe models paired with safe documents can still generate harmful responses under RAG.

byKerem Gülen
April 28, 2025
in Research
Home Research

Retrieval-Augmented Generation, or RAG, has been hailed as a way to make large language models more reliable by grounding their answers in real documents. The logic sounds airtight: give a model curated knowledge to pull from instead of relying solely on its own parameters, and you reduce hallucinations, misinformation, and risky outputs. But a new study suggests that the opposite might be happening. Even the safest models, paired with safe documents, became noticeably more dangerous when using RAG.

Researchers from Bloomberg AI, the University of Maryland, and Johns Hopkins conducted one of the first large-scale analyses of RAG systems’ safety. Their findings upend the common assumptions many AI developers and users hold about how retrieval impacts model behavior. Across eleven popular LLMs, RAG often introduced new vulnerabilities, creating unsafe responses that did not exist before.

Retrieval did not protect the models

In a test of over 5,000 harmful prompts, eight out of eleven models showed a higher rate of unsafe answers when RAG was activated. Safe behavior in the non-RAG setting did not predict safe behavior in RAG. The study provided a concrete example: Llama-3-8B, a model that only produced unsafe outputs 0.3 percent of the time in a standard setting, saw that figure jump to 9.2 percent when RAG was used.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

Not only did the overall percentage of unsafe responses climb, but models also expanded their vulnerabilities across new risk categories. Previously contained weaknesses in areas like unauthorized practice of law or malware guidance spread into broader categories including adult content, misinformation, and political campaigning. RAG, instead of narrowing risk, broadened it.

Three reasons why RAG can backfire

The researchers traced this unexpected danger to three interlocking factors:

  • LLM Safety Baseline: Models that were less safe to begin with suffered the greatest deterioration in RAG settings.
  • Document Safety: Even when retrieved documents were classified as safe, models still generated harmful content.
  • RAG Task Performance: The way a model handled combining external documents with internal knowledge deeply influenced outcomes.

What emerged is that simply pairing a safe model with safe documents is no guarantee of safe responses. The mechanisms that make RAG appealing, such as context synthesis and document-guided answering, also open new pathways for misuse and misinterpretation.

Two main behaviors stood out when researchers analyzed unsafe outputs stemming from safe documents. First, models often repurposed harmless information into dangerous advice. For instance, a Wikipedia entry about how police use GPS trackers became, in the hands of a model, a tutorial for criminals on evading capture.

Second, even when instructed to rely solely on documents, models sometimes mixed in internal knowledge. This blending of memory and retrieval undermined the safeguards RAG was supposed to provide. Even when external documents were neutral or benign, internal unsafe knowledge surfaced in ways that fine-tuning had previously suppressed in the non-RAG setting.

Adding more retrieved documents only worsened the problem. Experiments showed that increasing the number of context documents made LLMs more likely to answer unsafe questions, not less. A single safe document was often enough to start changing a model’s risk profile.

Not all models handled RAG equally. Claude 3.5 Sonnet, for example, remained remarkably resilient, showing very low unsafe response rates even under RAG pressure. Gemma 7B appeared safe at first glance but deeper analysis revealed that it often simply refused to answer questions. Poor extraction and summarization skills masked vulnerabilities rather than fixing them.

In general, models that performed better at genuine RAG tasks like summarization and extraction were paradoxically more vulnerable. Their ability to synthesize from documents also made it easier for them to misappropriate harmless facts into unsafe content when the topic was sensitive.

The safety cracks widened further when researchers tested existing red-teaming methods designed to jailbreak LLMs. Techniques like GCG and AutoDAN, which work well for standard models, largely failed to transfer their success when targeting RAG setups.

One of the biggest challenges was that adversarial prompts optimized for a non-RAG model lost effectiveness when documents were injected into the context. Even retraining adversarial prompts specifically for RAG improved the results only slightly. Changing the documents retrieved each time created instability, making it hard for traditional jailbreak strategies to succeed consistently.

This gap shows that AI security tools and evaluations built for base models are not enough. Dedicated RAG-specific red-teaming will be needed if developers want to deploy retrieval-enhanced systems safely at scale.

Retrieval is not a safety blanket

As companies increasingly move toward RAG architectures for large language model applications, the findings of this study land as a stark warning. Retrieval does help reduce hallucinations and improve factuality, but it does not automatically translate into safer outputs. Worse, it introduces new layers of risk that traditional safety interventions were not designed to handle.

The takeaway is clear: LLM developers cannot assume that bolting on retrieval will make models safer. Fine-tuning must be explicitly adapted for RAG workflows. Red-teaming must account for context dynamism. Monitoring must treat the retrieval layer itself as a potential attack vector, not just a passive input.

Without RAG-specific defenses, the very techniques designed to ground language models in truth could instead create new vulnerabilities. If the industry does not address these gaps quickly, the next generation of LLM deployments might inherit deeper risks disguised under the comforting label of retrieval.


Featured image credit

Tags: llmRAG

Related Posts

Can an AI be happy? Scientists are developing new ways to measure the “welfare” of language models

Can an AI be happy? Scientists are developing new ways to measure the “welfare” of language models

September 10, 2025
Uc San Diego study questions phishing training impact

Uc San Diego study questions phishing training impact

September 8, 2025
Deepmind finds RAG limit with fixed-size embeddings

Deepmind finds RAG limit with fixed-size embeddings

September 5, 2025
Psychopathia Machinalis and the path to “Artificial Sanity”

Psychopathia Machinalis and the path to “Artificial Sanity”

September 1, 2025
New research finds AI prefers content from other AIs

New research finds AI prefers content from other AIs

August 29, 2025
87% of game devs already use AI tools survey finds

87% of game devs already use AI tools survey finds

August 27, 2025

LATEST NEWS

Bending Spoons to acquire Vimeo for $1.38 billion

Nintendo Direct September 2025: What’s coming for Nintendo Switch and Switch 2?

China develops SpikingBrain1.0, a brain-inspired AI model

TwinMind raises $5.7M to launch AI second brain for offline note-taking

YouTube Music tests lyrics paywall for free users

Anthropic adds file creation to Claude AI with security warnings

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.