Researchers at Alibaba have developed SkillWeaver, a framework aimed at improving the routing of subtasks in enterprise AI systems. SkillWeaver creates execution graphs for tasks and selects the appropriate skills for each node. The framework incorporates Skill-Aware Decomposition (SAD), a technique that uses a feedback loop for iterative tool selection, distinguishing it from frameworks that choose tools in a one-shot manner.
SkillWeaver is specifically designed for real-world AI applications such as orchestrating multiple tools through the Model Context Protocol (MCP) for various business operations, including data handling and reporting. Tests show that SkillWeaver’s approach increases accuracy while reducing token consumption by over 99% compared to exposing agents to an entire tool library.
The primary challenge faced in AI systems is the granularity of task decomposition, as practical queries often involve compositional requests that require multiple skills. Skills are defined as modular, reusable specifications utilizing structured natural language documentation. Current AI frameworks often struggle by treating tool routing as a single-skill selection task, which is insufficient for complex workflows.
SkillWeaver’s operation consists of three stages: Decompose, Retrieve, and Compose. In the Decompose stage, an LLM breaks down complex user queries into manageable subtasks. Next, the Retrieve stage employs an embedding model to identify candidate tools for each subtask from a skill library. Finally, the Compose stage assesses the compatibility of these tools and formulates a Directed Acyclic Graph (DAG) that outlines the execution plan.
SkillWeaver also tackles the problem of LLMs generating generic descriptions by implementing the SAD feedback loop. This mechanism has the LLM draft an initial plan, retrieve matching skills, and refine its decomposition based on the retrieved tools, ensuring alignment with specific technical vocabularies.
To evaluate effectiveness, researchers created CompSkillBench, a benchmark featuring 300 multi-step queries based on 2,209 real-world skills. The core engine employed a 7-billion parameter model (Qwen2.5-7B-Instruct) for the decomposition process and a semantic search retriever. Testing revealed that the SAD feedback loop raised decomposition accuracy from 51.0% to 67.7%, with higher models reaching 92% accuracy.
Results highlighted that less guidance can lead to decreased performance in larger models. A vanilla setup using a larger model performed worse than the smaller model due to unnecessary task breakdowns. The research demonstrated that proper alignment with tool vocabulary is often more impactful than simply utilizing a larger model.
Significant token savings were noted, with SkillWeaver reducing context window consumption from approximately 884,000 tokens to about 1,160 tokens per query, leading to lowered API costs and faster response times. In contrast, the LLM-Direct method only managed a 21.1% accuracy rate in tool retrieval, while ReAct-style agents achieved 0% accuracy.
Although the source code for SkillWeaver has not been released, the researchers have provided prompt templates that developers can implement using existing libraries like LangChain and LlamaIndex. The framework requires initial vectorization of the tool library and building a FAISS index, which can be completed in a short time, minimizing latency during retrieval.
A limitation of SkillWeaver is its lack of error recovery in multi-step tool chains. The study indicated that if one step fails, it compromises the entire chain, highlighting a need for improvements in error handling mechanisms within the framework.





