Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory developed PDDL-INSTRUCT, a framework using logical reasoning and external validation to improve how large language models generate multi-step plans, achieving up to 94% validity on specific benchmarks.
The framework addresses the common failure of large language models (LLMs) to produce logically valid plans, which often sound plausible but are incorrect. PDDL-INSTRUCT counters this by integrating explicit state and action semantics with ground-truth checking. Through “error education,” models are trained to explain plan failures, including unsatisfied preconditions, incorrect effects, frame violations, or an unreached goal. A logical chain-of-thought (CoT) prompting method also guides the model to perform step-by-step inference, producing detailed state-action-state traces formatted as ⟨sᵢ, aᵢ₊₁, sᵢ₊₁⟩ based on formal semantics.
To ensure correctness, each step of a generated plan is verified by the external VAL plan validator. The system can receive either binary feedback (valid/invalid) or detailed feedback specifying which precondition or effect failed. Research indicated detailed feedback yielded the strongest performance gains. PDDL-INSTRUCT also utilizes a two-stage optimization process. The first stage optimizes the model’s reasoning chains by penalizing state-transition errors. The second stage then optimizes the final accuracy of the end-task plan, creating a systematic training regimen.
The system was evaluated on the PlanBench benchmark, which includes the Blocksworld, Mystery Blocksworld, and Logistics planning domains. Mystery Blocksworld is particularly challenging as it obfuscates predicate names to prevent pattern-matching; prior models reported less than 5% validity on this task without tool support. With PDDL-INSTRUCT, a Llama-3-8B model achieved up to 94% valid plans on Blocksworld. On Mystery Blocksworld, the framework produced orders-of-magnitude improvements, reported as up to 64 times better than baseline models. Substantial increases in valid plans were also recorded in the Logistics domain.
Across all domains, the framework demonstrated up to a 66% absolute improvement in generating valid plans compared to untuned baselines. Performance was further enhanced by using detailed validator feedback and longer feedback budgets during training. This neuro-symbolic approach grounds an LLM’s reasoning in formal semantics that are checked automatically. Its current scope is limited to classical Planning Domain Definition Language (PDDL) domains and requires VAL as an external oracle. The method shows utility for agent pipelines that can accommodate a verifier, while extensions for temporal, numeric, and cost-sensitive planning remain open challenges.