Apple Researchers Develop Token-efficient AI For Long-form Video Understanding

Researchers from Apple have introduced a new family of video large language models that are both highly efficient and powerful, particularly at smaller, mobile-friendly scales. The new research, detailed in a paper titled SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding, presents a new architecture that achieves state-of-the-art performance by intelligently balancing how it processes video frames.

The ‘SlowFast’ approach to video analysis

A primary challenge for existing video AI models is managing the immense computational cost of processing long video sequences. These models face a difficult trade-off: either they process a large number of frames, which significantly increases the number of tokens and computational resources required, or they reduce the number of tokens per frame, which inevitably loses fine-grained detail. To solve this, the Apple researchers developed a model family called SlowFast-LLaVA-1.5, which uses a two-stream mechanism to analyze video content in a more balanced and efficient way.

The core of this innovation is the SlowFast mechanism, which processes video through two parallel pathways simultaneously. The Slow pathway is designed to capture detailed spatial features—the “what” of the video. It operates at a low frame rate, analyzing fewer frames but in high resolution to understand the objects and semantics within the scene. In contrast, the Fast pathway is designed to capture motion cues—the “how” of the video. It operates at a high frame rate, processing many frames but with fewer tokens per frame, allowing it to focus on movement and long-range temporal context without a massive computational load. These two streams are then combined to give the language model a comprehensive yet token-efficient understanding of the video.

A key focus of the research was developing smaller, more efficient models that could potentially be deployed on edge devices. The paper highlights the performance of its 1B and 3B parameter models, demonstrating that even these relatively small models can achieve state-of-the-art results. For instance, the SF-LLaVA-1.5-1B model surpassed a larger competitor, Qwen2-VL-2B, across multiple benchmarks. Similarly, the 3B model outperformed its competitor, Apollo-3B, on both general video and temporal reasoning tasks.

The larger 7B model also set new state-of-the-art scores on long-form video understanding benchmarks, achieving 62.5% on LongVideoBench and 71.5% on MLVU. The efficiency of the SlowFast mechanism was a primary driver of this performance. In a direct comparison, the Apple model processed twice as many frames (128) as a competing model while using only about 65% of the input tokens (9K vs. 14K), yet it achieved better results across nearly all benchmarks.

In addition to performance, the researchers emphasized reproducibility. Unlike many state-of-the-art models that rely on large, internal datasets, SlowFast-LLaVA-1.5 was trained using a streamlined two-stage pipeline and exclusively on publicly available datasets. The first stage of training uses only images to give the model a strong foundation in general knowledge and reasoning. The second stage performs joint video-image training to learn temporal features while maintaining strong performance on still images. An ablation study confirmed the effectiveness of the combined SlowFast approach, showing that it outperforms models using only a Slow or Fast pathway individually. The study also demonstrated that the joint video-image training was a key factor in improving the model’s capabilities on both modalities.