Large language models have struggled with multi-digit multiplication without specialized training methods, despite their ability to handle complex coding and reasoning tasks, according to a recent study.
Research published on the arXiv preprint server by the University of Chicago’s Xiaoyan Bai and Chenhao Tan, along with collaborators from MIT, Harvard University, the University of Waterloo, and Google DeepMind, identified the reasons for this limitation and found solutions.
Standard large language models achieved less than 1% accuracy when multiplying two four-digit numbers, even with increased layers up to 12. These models converged on a “local optimum,” failing to store and retrieve intermediate computations necessary for multi-digit multiplication, which are categorized as long-range dependencies.
Conversely, a model trained with the Implicit Chain of Thought (ICoT) method achieved 100% accuracy. The ICoT model demonstrated an ability to track long-range dependencies and internalize reasoning processes by gradually removing intermediate reasoning steps during training. The research team decoded intermediate values, such as running sums, from the ICoT model’s internal states, which was not possible with the standard fine-tuning model.
The ICoT model organized its attention into distinct pathways, computing products of digit pairs in early layers and storing them in specific locations for retrieval in later layers. This created an efficient internal structure for multiplication. The study also found that the ICoT model represented operations using elegant structures, encoding digits as wave-like patterns (Fourier bases) and organizing arithmetic spatially. During multiplication of digit pairs, the model naturally utilized a geometric operation called a Minkowski sum, which was not explicitly programmed by the researchers.
Researchers achieved 99% accuracy in a two-layer model by introducing a modified training objective that taught the model to track running sums at each step, thereby carrying intermediate values and partial products forward. This addition enabled the model to develop mechanisms similar to ICoT’s, including storing and retrieving partial products and tracking multiple digit pairs simultaneously.
Chenhao Tan said, “Our research is trying to chart that terrain.” The study highlights that architectural insights and training techniques can overcome obstacles that scaling alone cannot address, emphasizing the importance of built-in guidance in advancing AI capabilities.
The findings illuminate fundamental aspects of how large language models learn and “think,” with the long-range dependency problem extending beyond arithmetic to other sequential tasks in language modeling.





