Researchers from Apple and Tel-Aviv University have developed a method to accelerate AI-based text-to-speech generation. This approach maintains speech intelligibility.
The research, detailed in a paper titled “Principled Coarse-Grained Acceptance for Speculative Decoding in Speech,” focuses on autoregressive text-to-speech models. These models generate speech tokens sequentially.
Autoregressive models predict the next token based on preceding tokens. In speech generation, these tokens represent audio chunks. The Apple researchers identified a processing bottleneck in existing autoregressive speech generation. Exact token matching in speech Large Language Models (LLMs) can be overly restrictive. Many discrete tokens are acoustically or semantically interchangeable, which reduces acceptance rates and limits speedups.
Their solution, called Principled Coarse-Graining (PCG), groups speech tokens that produce similar sounds. This creates a more flexible verification step. PCG employs two models: a smaller model proposes speech tokens quickly, while a larger “judge” model verifies if these tokens belong to the correct acoustic group before acceptance.
This framework adapts speculative decoding (SD) concepts to LLMs that generate acoustic tokens. It accelerates speech generation while ensuring intelligibility. PCG increased speech generation speed by approximately 40%. Standard speculative decoding barely improved speed in speech models.
The approach also maintained lower word error rates compared to prior speed-focused methods. It preserved speaker similarity and achieved a naturalness score of 4.09 on a 1–5 human rating scale, outperforming previous speed-focused methods.
In one stress test, 91.4% of speech tokens were replaced with alternatives from the same acoustic group. This resulted in only a +0.007 increase in word error rate and a -0.027 drop in speaker similarity.
PCG is a decoding-time change, meaning it does not necessitate retraining the target model. It can be applied to existing speech models during inference.
The method requires minimal additional resources, needing about 37MB of memory to store acoustic similarity groups. This makes it practical for deployment on devices with limited memory.





