Large language models (LLMs) are increasingly adept at generating computer code, promising to accelerate software development. However, this speed advantage is only beneficial if the generated code is correct, adheres to the programming language’s rules, and doesn’t lead to system crashes. A new approach developed by researchers at MIT and collaborating institutions now offers a way to automatically guide LLMs to produce text, particularly code, that is both structurally valid and semantically accurate, all while improving computational efficiency.
Balancing speed, structure, and meaning in AI-generated code
Programmers are turning to LLMs as powerful assistants capable of drafting code snippets, functions, and even entire modules in seconds. The catch? Ensuring this AI-generated code is usable. Code must rigidly follow the syntax of a specific programming language (its structure) and perform the intended task correctly (its meaning). Existing methods to enforce these constraints on LLMs often face a trade-off: they might distort the model’s intended output, thereby sacrificing accuracy, or they become too computationally intensive and slow for complex, real-world applications.
One common strategy involves generating a complete block of code and then validating it. If errors are found – a frequent occurrence – the entire process must be restarted, consuming significant time and computational resources. Another tactic is to check the output incrementally. While this can help ensure structural validity along the way, constant corrections can cause the code to drift from the user’s original intent, impacting its overall accuracy and usefulness.
“It is much easier to enforce structure than meaning,” notes João Loula, an MIT graduate student and co-lead author of a paper on this new framework. “We can quickly check whether something is in the right programming language, but to check its meaning you have to execute the code. Our work is also about dealing with these different types of information.”
Probabilistic guidance with Sequential Monte Carlo
The innovative method, developed by an international team including researchers from MIT, Mila-Quebec Artificial Intelligence Institute, Johns Hopkins University, Yale University, ETH Zurich, and McGill University, introduces a sophisticated way to steer LLMs. Instead of just post-correction, their architecture guides the LLM during the generation process, encouraging it to allocate its efforts towards outputs most likely to be both valid and accurate. Unpromising avenues are discarded early, leading to a significant boost in computational efficiency thanks to this probabilistic approach.
The researchers achieve this using a powerful statistical technique called Sequential Monte Carlo (SMC). This method allows multiple parallel generation “threads” from the LLM to essentially compete with each other. The model dynamically allocates more computational resources to threads that appear more promising as they generate text.
“We are not trying to train an LLM to do this,” adds Vikash Mansinghka, a principal research scientist at MIT and co-senior author. “Instead, we are engineering some knowledge that an expert would have and combining it with the LLM’s knowledge, which offers a very different approach to scaling than you see in deep learning.”
The core idea is to integrate expert knowledge into the LLM’s generation process. Each potential output path is assigned a “weight” that reflects its likelihood of being structurally correct (e.g., valid Python syntax) and semantically accurate (i.e., doing what the user wants). At each step of the generation, the model focuses its computational power on the paths with higher weights, effectively pruning those that are less likely to succeed.
With AI writing 30% of its code Microsoft now cuts human coder jobs
It’s akin to having an expert programmer looking over the LLM’s shoulder, offering guidance at each decision point while keeping the overall goal in mind. The user initially specifies their desired output structure, its intended meaning, and how the system should check the output. The new architecture then guides the LLM to fulfill these requirements efficiently.
“We’ve worked out the hard math so that, for any kinds of constraints you’d like to incorporate, you are going to get the proper weights,” Loula explains. “In the end, you get the right answer.” This sophisticated control ensures that the LLM doesn’t just produce plausible-sounding text, but text that is genuinely useful and correct within the specified constraints.
Putting it to the test
The efficacy of this framework was demonstrated across several challenging, real-world use cases. The researchers tasked LLMs with generating four distinct types of outputs:
- Python computer code
- SQL database queries
- Molecular structures
- Sequential plans for a robot to follow
When compared to existing approaches for controlling LLM outputs, the new method consistently performed with higher accuracy while demanding less computation. One of the most striking results came from Python code generation. The researchers’ architecture enabled a relatively small, open-source LLM to outperform a specialized, commercial closed-source model that was more than double its size in generating accurate and properly structured code.
“We are very excited that we can allow these small models to punch way above their weight,” Loula states, highlighting the efficiency and power unlocked by their approach.
The impact of this research extends far beyond making programmers’ lives easier. In the long run, this architecture could democratize access to complex AI-generated content. For example, business professionals with no coding expertise could potentially write complex queries in SQL (a database manipulation language) using only natural language prompts, with the system ensuring the SQL generated is both valid and accurately reflects their request.
“This work has implications beyond research. It could improve programming assistants, AI-powered data analysis, and scientific discovery tools by ensuring that AI-generated outputs remain both useful and correct,” says Loula. Mansinghka adds that the approach could enable machine-assisted data analysis systems where users can converse with software that accurately models the meaning of data and the questions being asked.
Timothy J. O’Donnell, an associate professor at McGill University who led the international team, also points to deeper connections: “One of the fundamental questions of linguistics is how the meaning of words, phrases, and sentences can be grounded in models of the world… LLMs, predicting likely token sequences, don’t address this problem. Our paper shows that, in narrow symbolic domains, it is technically possible to map from words to distributions on grounded meanings. It’s a small step towards deeper questions in cognitive science, linguistics, and artificial intelligence needed to understand how machines can communicate about the world like we do.”