Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Apple researchers develop token-efficient AI for long-form video understanding

The core of this innovation is the SlowFast mechanism, which processes video through two parallel pathways simultaneously.

byEmre Çıtak
August 22, 2025
in Research
Home Research

Researchers from Apple have introduced a new family of video large language models that are both highly efficient and powerful, particularly at smaller, mobile-friendly scales. The new research, detailed in a paper titled SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding, presents a new architecture that achieves state-of-the-art performance by intelligently balancing how it processes video frames.

The ‘SlowFast’ approach to video analysis

A primary challenge for existing video AI models is managing the immense computational cost of processing long video sequences. These models face a difficult trade-off: either they process a large number of frames, which significantly increases the number of tokens and computational resources required, or they reduce the number of tokens per frame, which inevitably loses fine-grained detail. To solve this, the Apple researchers developed a model family called SlowFast-LLaVA-1.5, which uses a two-stream mechanism to analyze video content in a more balanced and efficient way.

The core of this innovation is the SlowFast mechanism, which processes video through two parallel pathways simultaneously. The Slow pathway is designed to capture detailed spatial features—the “what” of the video. It operates at a low frame rate, analyzing fewer frames but in high resolution to understand the objects and semantics within the scene. In contrast, the Fast pathway is designed to capture motion cues—the “how” of the video. It operates at a high frame rate, processing many frames but with fewer tokens per frame, allowing it to focus on movement and long-range temporal context without a massive computational load. These two streams are then combined to give the language model a comprehensive yet token-efficient understanding of the video.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

A key focus of the research was developing smaller, more efficient models that could potentially be deployed on edge devices. The paper highlights the performance of its 1B and 3B parameter models, demonstrating that even these relatively small models can achieve state-of-the-art results. For instance, the SF-LLaVA-1.5-1B model surpassed a larger competitor, Qwen2-VL-2B, across multiple benchmarks. Similarly, the 3B model outperformed its competitor, Apollo-3B, on both general video and temporal reasoning tasks.

The larger 7B model also set new state-of-the-art scores on long-form video understanding benchmarks, achieving 62.5% on LongVideoBench and 71.5% on MLVU. The efficiency of the SlowFast mechanism was a primary driver of this performance. In a direct comparison, the Apple model processed twice as many frames (128) as a competing model while using only about 65% of the input tokens (9K vs. 14K), yet it achieved better results across nearly all benchmarks.

In addition to performance, the researchers emphasized reproducibility. Unlike many state-of-the-art models that rely on large, internal datasets, SlowFast-LLaVA-1.5 was trained using a streamlined two-stage pipeline and exclusively on publicly available datasets. The first stage of training uses only images to give the model a strong foundation in general knowledge and reasoning. The second stage performs joint video-image training to learn temporal features while maintaining strong performance on still images. An ablation study confirmed the effectiveness of the combined SlowFast approach, showing that it outperforms models using only a Slow or Fast pathway individually. The study also demonstrated that the joint video-image training was a key factor in improving the model’s capabilities on both modalities.

Tags: AIApple

Related Posts

Radware tricks ChatGPT’s Deep Research into Gmail data leak

Radware tricks ChatGPT’s Deep Research into Gmail data leak

September 19, 2025
OpenAI research finds AI models can scheme and deliberately deceive users

OpenAI research finds AI models can scheme and deliberately deceive users

September 19, 2025
MIT studies AI romantic bonds in r/MyBoyfriendIsAI group

MIT studies AI romantic bonds in r/MyBoyfriendIsAI group

September 19, 2025
Anthropic economic index reveals uneven Claude.ai adoption

Anthropic economic index reveals uneven Claude.ai adoption

September 17, 2025
Google releases VaultGemma 1B with differential privacy

Google releases VaultGemma 1B with differential privacy

September 17, 2025
OpenAI researchers identify the mathematical causes of AI hallucinations

OpenAI researchers identify the mathematical causes of AI hallucinations

September 17, 2025

LATEST NEWS

Zoom announces AI Companion 3.0 at Zoomtopia

Google Cloud adds Lovable and Windsurf as AI coding customers

Radware tricks ChatGPT’s Deep Research into Gmail data leak

Elon Musk’s xAI chatbot Grok exposed hundreds of thousands of private user conversations

Roblox game Steal a Brainrot removes AI-generated character, sparking fan backlash and a debate over copyright

DeepSeek releases R1 model trained for $294,000 on 512 H800 GPUs

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.