Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Apple researchers develop token-efficient AI for long-form video understanding

The core of this innovation is the SlowFast mechanism, which processes video through two parallel pathways simultaneously.

byEmre Çıtak
August 22, 2025
in Research

Researchers from Apple have introduced a new family of video large language models that are both highly efficient and powerful, particularly at smaller, mobile-friendly scales. The new research, detailed in a paper titled SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding, presents a new architecture that achieves state-of-the-art performance by intelligently balancing how it processes video frames.

The ‘SlowFast’ approach to video analysis

A primary challenge for existing video AI models is managing the immense computational cost of processing long video sequences. These models face a difficult trade-off: either they process a large number of frames, which significantly increases the number of tokens and computational resources required, or they reduce the number of tokens per frame, which inevitably loses fine-grained detail. To solve this, the Apple researchers developed a model family called SlowFast-LLaVA-1.5, which uses a two-stream mechanism to analyze video content in a more balanced and efficient way.

The core of this innovation is the SlowFast mechanism, which processes video through two parallel pathways simultaneously. The Slow pathway is designed to capture detailed spatial features—the “what” of the video. It operates at a low frame rate, analyzing fewer frames but in high resolution to understand the objects and semantics within the scene. In contrast, the Fast pathway is designed to capture motion cues—the “how” of the video. It operates at a high frame rate, processing many frames but with fewer tokens per frame, allowing it to focus on movement and long-range temporal context without a massive computational load. These two streams are then combined to give the language model a comprehensive yet token-efficient understanding of the video.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

A key focus of the research was developing smaller, more efficient models that could potentially be deployed on edge devices. The paper highlights the performance of its 1B and 3B parameter models, demonstrating that even these relatively small models can achieve state-of-the-art results. For instance, the SF-LLaVA-1.5-1B model surpassed a larger competitor, Qwen2-VL-2B, across multiple benchmarks. Similarly, the 3B model outperformed its competitor, Apollo-3B, on both general video and temporal reasoning tasks.

The larger 7B model also set new state-of-the-art scores on long-form video understanding benchmarks, achieving 62.5% on LongVideoBench and 71.5% on MLVU. The efficiency of the SlowFast mechanism was a primary driver of this performance. In a direct comparison, the Apple model processed twice as many frames (128) as a competing model while using only about 65% of the input tokens (9K vs. 14K), yet it achieved better results across nearly all benchmarks.

In addition to performance, the researchers emphasized reproducibility. Unlike many state-of-the-art models that rely on large, internal datasets, SlowFast-LLaVA-1.5 was trained using a streamlined two-stage pipeline and exclusively on publicly available datasets. The first stage of training uses only images to give the model a strong foundation in general knowledge and reasoning. The second stage performs joint video-image training to learn temporal features while maintaining strong performance on still images. An ablation study confirmed the effectiveness of the combined SlowFast approach, showing that it outperforms models using only a Slow or Fast pathway individually. The study also demonstrated that the joint video-image training was a key factor in improving the model’s capabilities on both modalities.

Tags: AIApple

Related Posts

Have astronomers finally found the universe’s first dark stars?

Have astronomers finally found the universe’s first dark stars?

October 10, 2025
KPMG: CEOs prioritize AI investment in 2025

KPMG: CEOs prioritize AI investment in 2025

October 9, 2025
Physicists build and verify a quantum lie detector for large systems

Physicists build and verify a quantum lie detector for large systems

October 8, 2025
Lab breakthrough turns single laser into dozens of data streams on one chip

Lab breakthrough turns single laser into dozens of data streams on one chip

October 8, 2025
Project Paraphrase shows AI can redesign toxins to evade security screening

Project Paraphrase shows AI can redesign toxins to evade security screening

October 8, 2025
AI is now the number one channel for data exfiltration in the enterprise

AI is now the number one channel for data exfiltration in the enterprise

October 8, 2025

LATEST NEWS

Your Echo Show’s photo frame is now just another ad delivery system

Microsoft’s answer to OpenAI’s data centers: An AI factory

OpenAI says its new GPT-5 models are 30% less politically biased

Patent: Samsung Galaxy Z tri-fold uses 3 separate batteries

Intel’s comeback plan begins with a new US-made 18A processor

Microsoft: Cyber gang hijacks university salaries via Workday accounts

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.