Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Microsoft’s ADeLe wants to give your AI a cognitive profile

Microsoft and collaborators have introduced ADeLe a new AI evaluation framework that scores both models and tasks on 18 shared cognitive and knowledge-based scales.

byKerem Gülen
May 14, 2025
in Research
Home Research
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

Modern AI models are advancing at breakneck speed, but the way we evaluate them has barely kept pace. Traditional benchmarks tell us whether a model passed or failed a test but rarely offer insights into why it performed the way it did or how it might fare on unfamiliar challenges. A new research effort from Microsoft and its collaborators proposes a rigorous framework that reimagines how we evaluate AI systems.

Evaluating AI by what it needs to know

The core innovation introduced in this study is a framework called ADeLe, short for annotated-demand-levels. Instead of testing models in isolation, ADeLe scores both the model and the task on the same set of cognitive and knowledge-based scales. The result is a comprehensive profile that captures how demanding a task is and whether a specific AI system has the capabilities required to handle it.

ADeLe operates across 18 general scales, each reflecting a key aspect of cognitive or domain knowledge such as reasoning, attention, or formal subject matter expertise. Tasks are rated from 0 to 5 on each dimension, indicating how much that ability contributes to successful task completion. This dual-side annotation creates a kind of compatibility score between models and tasks, making it possible to predict outcomes and explain failures before they happen.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

Microsoft ADeLe wants to give your AI a ccognitive profile
Image: Microsoft

What sets ADeLe apart is its foundation in psychometrics—a field concerned with measuring human abilities. By adapting these human assessment tools for AI, the researchers built a framework that can be used reliably by automated systems. ADeLe was applied to 63 tasks from 20 established AI benchmarks, covering more than 16,000 examples. The researchers then used this dataset to assess 15 large language models, including industry leaders like GPT-4, LLaMA-3.1-405B, and DeepSeek-R1-Dist-Qwen-32B.

The process generated ability profiles for each model. These profiles illustrate how success rates vary with task complexity across different skills, offering a granular understanding of model capabilities. Radar charts visualize these profiles across the 18 ability dimensions, revealing nuanced patterns that raw benchmark scores alone cannot.

This extensive evaluation surfaced several findings that challenge current assumptions about AI performance and progress.

  1. First, existing AI benchmarks often fail to test what they claim. For example, a benchmark designed for logical reasoning might also require niche domain knowledge or high levels of metacognition, diluting its intended focus.
  2. Second, the team uncovered distinct ability patterns in large language models. Reasoning-focused models consistently outperformed others in tasks involving logic, abstraction, and understanding social context. However, raw size alone did not guarantee superiority. Past a certain point, scaling up models produced diminishing returns in many ability areas. Training techniques and model design appeared to play a larger role in refining performance across specific cognitive domains.
  3. Third, and perhaps most significantly, ADeLe enabled accurate predictions of model success on unfamiliar tasks. By comparing task demands with model abilities, the researchers achieved prediction accuracies of up to 88 percent. This represents a substantial leap over black-box approaches that rely on embeddings or fine-tuned scores without any understanding of task difficulty or model cognition.
Microsoft ADeLe wants to give your AI a ccognitive profile
Image: Microsoft

Using the ability-demand matching approach, the team developed a system capable of forecasting AI behavior across a wide range of scenarios. Whether applied to new benchmarks or real-world challenges, this system provides a structured and interpretable method for anticipating failures and identifying suitable models for specific use cases. This predictive capability is particularly relevant in high-stakes environments where reliability and accountability are non-negotiable.

Rather than deploying AI based on general reputation or limited task scores, developers and decision-makers can now use demand-level evaluations to match systems to tasks with far greater confidence. This supports not only more reliable implementation but also better governance, as stakeholders can trace model behavior back to measurable abilities and limitations.


Is your super helpful generative AI partner secretly making your job boring?


The implications of ADeLe extend beyond research labs. This evaluation method offers a foundation for standardized, interpretable assessments that can support everything from AI research and product development to regulatory oversight and public trust. As general-purpose AI becomes embedded in sectors like education, healthcare, and law, understanding how models will behave outside of their training context becomes not just useful but essential.

ADeLe’s modular design allows it to be adapted to multimodal and embodied systems, further expanding its relevance. It aligns with Microsoft’s broader position on the importance of psychometrics in AI and echoes calls in recent white papers for more transparent, transferable, and trustworthy AI evaluation tools.

Toward smarter evaluation standards

For all the optimism around foundation models, one of the looming risks has been the lack of meaningful evaluation practices. Benchmarks have driven progress, but they have also limited our visibility into what models actually understand or how they might behave in unexpected situations. With ADeLe, we now have a path toward changing that.

This work reframes evaluation not as a checklist of scores but as a dynamic interaction between systems and tasks. By treating performance as a function of demand-ability fit, it lays the groundwork for a more scientific, reliable, and nuanced understanding of AI capabilities. That foundation is critical not only for technical progress but also for responsible adoption of AI in complex human contexts.


Featured image credit

Tags: AIFeaturedMicrosoft

Related Posts

Attackers use “native phishing” with M365 and AI tools

Attackers use “native phishing” with M365 and AI tools

July 17, 2025
Why we might lose our only window into how AI thinks

Why we might lose our only window into how AI thinks

July 17, 2025
AI learns language like a kid learns to read

AI learns language like a kid learns to read

July 16, 2025
Stanford study finds AI chatbots frequently violate therapy best practices

Stanford study finds AI chatbots frequently violate therapy best practices

July 15, 2025
The story behind the new 10.44% efficient solar cell design

The story behind the new 10.44% efficient solar cell design

July 15, 2025
Apple Watch data can predict your health with 92% accuracy

Apple Watch data can predict your health with 92% accuracy

July 11, 2025

LATEST NEWS

AI bands are now topping the charts and earning real money

Nvidia CEO says AI is making him smarter

Apple’s iPhone 17 Pro to get an exclusive new display

These are the best iOS 26 features your old iPhone won’t get

Is your Mac about to lose support for Google Chrome?

ChatGPT’s Record Mode is now available for Plus users on Mac

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.