Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Sleep-time compute: Meet the LLM that thinks while you sleep

New research reveals a powerful optimization for LLMs: "sleep-time compute." Instead of relying solely on complex real-time reasoning, models can pre-process context offline. Read how this technique achieves up to 5x lower token usage for the same accuracy and rewrites the cost-performance curve for AI.

byKerem Gülen
April 18, 2025
in Research
Home Research
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

You tap “Run” on a GPT‑powered assistant and then watch the spinner. Seconds stretch into minutes, token meters climb, and the meter on your OpenAI invoice creeps higher. Latency and cost have become the invisible tax on the large language model boom, especially when a single tough query can trigger thousands of fresh inference tokens. A new research proposal called sleep‑time compute argues that those tokens are often spent in the wrong phase of the workflow. Instead of cramming all reasoning into the moment the user hits Enter, why not let the model “think” during its idle hours, transform raw context into reusable insight, and slash the bill when the real question finally arrives?

The idea feels familiar to anyone who ever scheduled a database index or compiled code before shipping: preprocess while nobody is looking, respond instantly when they are. Yet applying that mindset to language models requires fresh benchmarks, careful accounting, and proof that offline effort transfers to online accuracy. Kevin Lin and colleagues from Letta and UC Berkeley supply exactly that evidence in “Sleep‑time Compute: Beyond Inference Scaling at Test‑time,” and their numbers suggest a rethink of how enterprise AI products budget GPU cycles.

Traditional test‑time scaling tells an LLM to work harder when the question is hard: sample multiple chains of thought, extend the reasoning trace, rerank responses, or fork dozens of candidate answers in parallel. Those tricks boost accuracy for math, coding, and knowledge tasks, but they also inflate latency and wallet drain. Users wait; vendors pay. Worse, the paradigm assumes each query is a stateless one‑off that arrives with its full context in the same request.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

In the real world, contexts persist. Customer‑support bots reread the same knowledge base, coding agents navigate the same repository, and research copilots revisit a shared document corpus. The authors argue that in these stateful settings, enormous chunks of reasoning are performed redundantly. Sleep‑time compute exploits that redundancy by letting the model pre‑parse the context during idle windows, create a distilled, inference‑ready representation, and store it for later reuse. When the user finally asks, the LLM answers in a fraction of the tokens because much of the heavy lifting is already baked into the prompt.

Why sleep‑time compute rewrites the cost curve

The researchers formalize the workflow in two phases. During sleep‑time the model sees only the context c, predicts likely angles of interest, and produces a rewritten context c′ that contains intermediate deductions, structured summaries, or cached chain‑of‑thought snippets. During test‑time the user’s query q arrives. The model now receives c′ instead of the raw context and can reach the correct answer with a far smaller compute budget b. Because idle hours are cheap and parallelizable, the organization pays low‑priority rates for the preprocessing and preserves premium inference capacity for user‑facing responsiveness.

To quantify the benefit, the team split two classic math‑reasoning suites—GSM‑Symbolic and AIME—into Stateful variants where every problem is decomposed into a context paragraph and a separate question. They also built Multi‑Query GSM‑Symbolic, in which each context spawns several related questions, mimicking a user who keeps poking at the same document. The evaluation matrix compared baseline GPT‑4o, GPT‑4o‑mini, o1, o3‑mini, Claude Sonnet, and DeepSeek‑R1 under three conditions: standard test‑time scaling, sleep‑time compute with different offline budgets, and pass‑@k parallel sampling.

What the experiments show

Across every model except the smallest o1, the sleep‑time strategy pushed the accuracy‑per‑token frontier outward. On Stateful GSM‑Symbolic and Stateful AIME the authors report:

  • 5 × lower test‑time tokens to hit the same accuracy as the baseline sequential chain‑of‑thought runs.
  • 13 percent accuracy gain on GSM when the offline budget scaled up to five parallel sleep‑time generations.
  • 18 percent accuracy gain on AIME with higher‑effort offline reasoning traces.
  • 2.5 × reduction in average cost per query when ten related questions shared the same preprocessed context.

Perhaps more striking, sleep‑time compute beat the canonical pass‑@k trick at equal test‑time budgets. Pass‑@k assumes an oracle verifier can instantly pick the best of k sampled answers, an unrealistic crutch in production. Sleep‑time compute reaches higher accuracy without that luxury because the heavy reasoning already lives in c′.

The payoff is sensitive to how predictable the eventual question is. When the researchers binned GSM items by the log probability that Llama‑2 assigned to the question given the context, the accuracy delta between sleep‑time and baseline widened for the most predictable quintile. In plain English: the more obvious the follow‑up question, the bigger the win from preparing your homework in advance.

Numbers are one thing; product implications are another. The authors run a real repository test called SWE‑Features in which an agent must modify three or more files to implement a feature. With only low test‑time budgets, sleep‑time compute cut token use by about 50 percent while matching F1, meaning faster merges and lower GPU bills on continuous‑integration bots. At very high budgets, classic test‑time reasoning regained a slight edge in precision, suggesting a hybrid policy: allocate offline compute aggressively when latency matters or when contexts will be reused, fall back to rich online chains only for one‑off or highly unpredictable queries.

The framework also opens doors for synthetic data generation. If sleep‑time reasoning produces rich natural‑language representations of a codebase or document, those artifacts themselves become training data for future fine‑tuning—a virtuous loop where offline thinking seeds the next generation of model improvements without scraping more internet text.

Operationally, the technique invites engineering questions. How often should the context cache refresh? How large can c′ grow before it cancels the token savings? Which idle cycles are really free in a shared cluster? Yet none of these hurdles look as formidable as the current reality of paying real‑time prices for redundant reasoning. Enterprises that already schedule nightly builds, search‑index crawls, or materialized views have mental models for this optimization.


How LLMs are quietly becoming the ultimate city historians


Where offline thinking fits next

Sleep‑time compute is not a silver bullet. Queries that blind‑side the system or contexts that mutate too rapidly will still demand fresh chains of thought. The paper itself flags open research into adaptive policies that predict when offline investment will pay off, perhaps by estimating context entropy or user intent distribution. Even so, the core takeaway stands: large language models do not need to think only when the user is watching. By borrowing an age‑old computing trick—do tomorrow’s work tonight—developers can cut latency, shrink bills, and still climb the accuracy ladder.

The upshot: Your next LLM feature might not require a bigger model or a deeper reasoning budget. It might simply require letting the model sleep on the problem first.


Featured image credit

Tags: AIllm

Related Posts

Apple research paper unveils Matrix3D for 3D content generation

Apple research paper unveils Matrix3D for 3D content generation

May 14, 2025
Microsoft’s ADeLe wants to give your AI a cognitive profile

Microsoft’s ADeLe wants to give your AI a cognitive profile

May 14, 2025
Is your super helpful generative AI partner secretly making your job boring?

Is your super helpful generative AI partner secretly making your job boring?

May 14, 2025
AI research tools might be creating more problems than they solve

AI research tools might be creating more problems than they solve

May 13, 2025
Research: The gold standard for GenAI evaluation

Research: The gold standard for GenAI evaluation

May 12, 2025
AI finally solves biology’s hardest puzzle

AI finally solves biology’s hardest puzzle

May 6, 2025

LATEST NEWS

Google brings NotebookLM to mobile with new standalone apps

Spotify now lets you buy audiobooks inside your iPhone

Microsoft now lets you build a custom AI army with new Copilot Tuning

Your website can now get its own AI brain thanks to Microsoft’s NLWeb

Is TikTok’s new meditation push a real safety play or just good PR optics?

The surprising true story behind that viral Apple App Store payment warning

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.