Apple researchers published a study detailing how large language models (LLMs) can interpret audio and motion data to identify user activities, focusing on late multimodal sensor fusion for activity recognition.
The paper, titled “Using LLMs for Late Multimodal Sensor Fusion for Activity Recognition,” by Ilker Demirel, Karan Ketankumar Thakkar, Benjamin Elizalde, Miquel Espi Marques, Shirley Ren, and Jaya Narain, was accepted at the Learning from Time Series for Health workshop at NeurIPS 2025. This research explores integrating LLM analysis with traditional sensor data to enhance activity classification.
The researchers state, “Sensor data streams provide valuable information around activities and context for downstream applications, though integrating complementary information can be challenging. We show that large language models (LLMs) can be used for late fusion for activity classification from audio and motion time series data.” They curated a subset of data for diverse activity recognition from the Ego4D dataset, encompassing household activities and sports.
Evaluated LLMs achieved 12-class zero- and one-shot classification F1-scores significantly above chance, without task-specific training. Zero-shot classification through LLM-based fusion from modality-specific models enables multimodal temporal applications with limited aligned training data for a shared embedding space. LLM-based fusion allows model deployment without requiring additional memory and computation for targeted application-specific multimodal models.
The study highlights LLMs’ ability to infer user activities from basic audio and motion signals, showing improved accuracy with a single example. Crucially, the LLM was not directly fed raw audio. Instead, it received short text descriptions generated by audio models and an IMU-based motion model, which tracks movement via accelerometer and gyroscope data.
For the study, researchers utilized Ego4D, a dataset featuring thousands of hours of first-person perspective media. They curated a dataset of daily activities from Ego4D by searching narrative descriptions. The curated dataset includes 20-second samples from twelve high-level activities:
These activities were chosen to cover household and fitness tasks and based on their prevalence in the larger Ego4D dataset. Audio and motion data were processed through smaller models to generate text captions and class predictions. These outputs were then fed into different LLMs, specifically Gemini-2.5-pro and Qwen-32B, to assess activity identification accuracy.
Apple compared model performance in two scenarios: a closed-set test where models chose from the 12 predefined activities, and an open-ended test without provided options. Various combinations of audio captions, audio labels, IMU activity prediction data, and extra context were used for each test.
The researchers noted that the results offer insights into combining multiple models for activity and health data. This approach is particularly beneficial when raw sensor data alone is insufficient to provide a clear picture of user activity. Apple also published supplemental materials, including Ego4D segment IDs, timestamps, prompts, and one-shot examples, to facilitate reproducibility for other researchers.





