New Research Shows AI Logic Survives Even When Its Memory Is Erased

Goodfire.ai researchers isolated memorization and reasoning pathways in AI neural networks, detailed in a late October preprint paper.

The research demonstrates a clear separation of these functions within large language models. When memorization pathways were removed, models lost 97 percent of their ability to recite verbatim training data. Their “logical reasoning” ability, however, remained largely intact.

Researchers ranked weight components from high to low based on “curvature.” In Allen Institute for AI’s OLMo-7B language model, layer 22 showed that the bottom 50 percent of weight components had 23 percent higher activation on memorized data. Conversely, the top 10 percent exhibited 26 percent higher activation on general, non-memorized text.

This mechanistic split allowed for surgical removal of memorization while preserving other capabilities. Deleting bottom-ranked components eliminated memorization; retaining top-ranked ones handled problem-solving.

Arithmetic operations appear to share neural pathways with memorization rather than logical reasoning. Removing memorization circuits caused mathematical performance to plummet to 66 percent, while logical tasks remained nearly untouched. This may explain why AI models struggle with math without external tools, relying on memorized facts like “2+2=4” rather than computation.

AI “reasoning” encompasses abilities like evaluating true/false statements and following if-then rules, which survived memory removal. This differs from deeper “mathematical reasoning” needed for proofs or novel problem-solving, which current AI models struggle with even with intact pattern-matching abilities.

Future development of these information removal techniques could enable AI companies to remove copyrighted content, private information, or harmful memorized text from neural networks without destroying transformative task performance. However, researchers state their method “cannot guarantee complete elimination of sensitive information” due to the distributed nature of information storage in neural networks.

Understanding this distinction involves the “loss landscape,” a visualization of an AI model’s prediction accuracy based on internal settings or “weights.” “Loss” measures errors, with low loss indicating few errors. The “landscape” maps error rates for all possible setting combinations. During training, AI models adjust weights to minimize errors, effectively “rolling downhill” in this landscape.

Researchers analyzed the “curvature” of loss landscapes, measuring the sensitivity of model performance to small changes in neural network weights. High curvature indicates sharp peaks and valleys, meaning small changes have significant effects. Low curvature signifies flat plains where changes have minimal impact. These curvature values were used to rank weight components.

Using K-FAC (Kronecker-Factored Approximate Curvature), scientists found that individual memorized facts create sharp, idiosyncratic spikes in the landscape that flatten when averaged. In contrast, reasoning abilities, relied upon by many different inputs, maintain consistent, moderate curves.

Researchers indicate that “directions that implement shared mechanisms used by many inputs add coherently and remain high-curvature on average,” describing reasoning pathways. Memorization, conversely, uses “idiosyncratic sharp directions associated with specific examples” that appear flat when averaged.

The technique was tested on multiple AI systems, including Allen Institute’s OLMo-2 family (7 billion- and 1 billion-parameter versions) and custom 86 million-parameter Vision Transformers (ViT-Base models) on ImageNet. They also validated findings against existing methods like BalancedSubnet.

Selectively removing low-curvature weight components resulted in memorized content recall dropping to 3.4 percent from nearly 100 percent. Logical reasoning tasks maintained 95 to 106 percent of baseline performance.

Logical tasks included Boolean expression evaluation, logical deduction puzzles, object tracking, BoolQ for yes/no reasoning, Winogrande for common sense inference, and OpenBookQA for science questions. Mathematical operations and closed-book fact retrieval, sharing pathways with memorization, dropped to 66 to 86 percent performance after editing. Arithmetic proved particularly brittle, with calculations failing even with identical reasoning chains after low-curvature components were removed.

The team explained, “Arithmetic problems themselves are memorized at the 7B scale, or because they require narrowly used directions to do precise calculations.” Open-book question answering, relying on provided context, maintained nearly full performance.

Mechanism separation varied by information type; common facts like country capitals showed minimal change after editing, while rare facts like company CEOs dropped 78 percent, suggesting differential neural resource allocation based on information frequency in training.

The K-FAC technique outperformed existing memorization removal methods, achieving 16.1 percent memorization on unseen historical quotes versus 60 percent for BalancedSubnet. Vision transformers showed similar patterns, with removing memorization pathways restoring 66.5 percent accuracy on previously mislabeled images.

Researchers acknowledge limitations; removed memories might return with further training, as current unlearning methods primarily suppress information. The reason for math’s fragility upon memorization removal is unclear, as is whether certain complex capabilities are misidentified as memorization. Additionally, mathematical tools for measuring the model’s “landscape” can be unreliable at extremes.

Featured image credit