Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Apple researchers develop token-efficient AI for long-form video understanding

The core of this innovation is the SlowFast mechanism, which processes video through two parallel pathways simultaneously.

byEmre Çıtak
August 22, 2025
in Research
Home Research
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

Researchers from Apple have introduced a new family of video large language models that are both highly efficient and powerful, particularly at smaller, mobile-friendly scales. The new research, detailed in a paper titled SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding, presents a new architecture that achieves state-of-the-art performance by intelligently balancing how it processes video frames.

The ‘SlowFast’ approach to video analysis

A primary challenge for existing video AI models is managing the immense computational cost of processing long video sequences. These models face a difficult trade-off: either they process a large number of frames, which significantly increases the number of tokens and computational resources required, or they reduce the number of tokens per frame, which inevitably loses fine-grained detail. To solve this, the Apple researchers developed a model family called SlowFast-LLaVA-1.5, which uses a two-stream mechanism to analyze video content in a more balanced and efficient way.

The core of this innovation is the SlowFast mechanism, which processes video through two parallel pathways simultaneously. The Slow pathway is designed to capture detailed spatial features—the “what” of the video. It operates at a low frame rate, analyzing fewer frames but in high resolution to understand the objects and semantics within the scene. In contrast, the Fast pathway is designed to capture motion cues—the “how” of the video. It operates at a high frame rate, processing many frames but with fewer tokens per frame, allowing it to focus on movement and long-range temporal context without a massive computational load. These two streams are then combined to give the language model a comprehensive yet token-efficient understanding of the video.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

A key focus of the research was developing smaller, more efficient models that could potentially be deployed on edge devices. The paper highlights the performance of its 1B and 3B parameter models, demonstrating that even these relatively small models can achieve state-of-the-art results. For instance, the SF-LLaVA-1.5-1B model surpassed a larger competitor, Qwen2-VL-2B, across multiple benchmarks. Similarly, the 3B model outperformed its competitor, Apollo-3B, on both general video and temporal reasoning tasks.

The larger 7B model also set new state-of-the-art scores on long-form video understanding benchmarks, achieving 62.5% on LongVideoBench and 71.5% on MLVU. The efficiency of the SlowFast mechanism was a primary driver of this performance. In a direct comparison, the Apple model processed twice as many frames (128) as a competing model while using only about 65% of the input tokens (9K vs. 14K), yet it achieved better results across nearly all benchmarks.

In addition to performance, the researchers emphasized reproducibility. Unlike many state-of-the-art models that rely on large, internal datasets, SlowFast-LLaVA-1.5 was trained using a streamlined two-stage pipeline and exclusively on publicly available datasets. The first stage of training uses only images to give the model a strong foundation in general knowledge and reasoning. The second stage performs joint video-image training to learn temporal features while maintaining strong performance on still images. An ablation study confirmed the effectiveness of the combined SlowFast approach, showing that it outperforms models using only a Slow or Fast pathway individually. The study also demonstrated that the joint video-image training was a key factor in improving the model’s capabilities on both modalities.

Tags: AIApple

Related Posts

Nature study projects 2B wearable health devices by 2050

Nature study projects 2B wearable health devices by 2050

January 7, 2026
DeepSeek introduces Manifold-Constrained Hyper-Connections for R2

DeepSeek introduces Manifold-Constrained Hyper-Connections for R2

January 6, 2026
Imperial College London develops AI to accelerate cardiac drug discovery

Imperial College London develops AI to accelerate cardiac drug discovery

January 5, 2026
DarkSpectre malware infects 8.8 million users via browser extensions

DarkSpectre malware infects 8.8 million users via browser extensions

January 2, 2026
CMU researchers develop self-moving objects powered by AI

CMU researchers develop self-moving objects powered by AI

December 31, 2025
Glean’s Work AI Institute identifies 5 core AI tensions

Glean’s Work AI Institute identifies 5 core AI tensions

December 31, 2025

LATEST NEWS

Xbox Developer Direct returns January 22 with Fable and Forza Horizon 6

Dell debuts disaggregated infrastructure for modern data centers

TikTok scores partnership with FIFA for World Cup highlights

YouTube now lets you hide Shorts in search results

Google transforms Gmail with AI Inbox and natural language search

Disney+ to launch TikTok-style short-form video feed in the US

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.