Meta AI Llama 3.1 405B Surprisingly Beats GPT-4o

Leaked benchmarks regarding Meta AI’s Llama 3.1 405B show that this open-source LLM has a lot of potential.

Leaked: Meta AI Llama 3.1 405B benchmarks

Meta introduced Llama 3 in April 2024 as a new generation of cutting-edge, open-source large language models. The initial release included Llama 3 8B and Llama 3 70B, both of which established new performance benchmarks for LLMs in their respective sizes. However, within just three months, several other models have managed to surpass these initial benchmarks, indicating the rapid pace of advancement in the field of artificial intelligence.

Update: The 3.1 405B model is available for use at https://www.llama2.ai/. It’s important to note that the information feeding this model is up-to-date only until April 2023. Therefore, while this model offers a robust and sophisticated AI resource, its understanding and responses are based on data accumulated until that specified period. This limitation could influence the model’s effectiveness in handling queries about more recent developments or events.

Meta has announced that its most ambitious model in the Llama 3 series will boast over 400 billion parameters, a massive leap in scale that is still undergoing training. In a dramatic turn of events, early benchmark data for the forthcoming Llama 3.1 models—including the 8B, 70B, and the colossal 405B—were leaked on the LocalLLaMA subreddit today. The preliminary results suggest that the Llama 3.1 405B model could potentially surpass the performance of the current industry leader, OpenAI’s GPT-4o, across several critical AI benchmarks.

Should the Llama 3.1 405B model indeed surpass GPT-4o, it would represent the first instance of an open-source model eclipsing a leading closed-source LLM.

Benchmarks	GPT-4o	Meta Llama-3.1-405B	Meta Llama-3.1-70B	Meta Llama-3-70B	Meta Llama-3.1-8B	Meta Llama-3-8B
boolq	0.905	0.921	0.909	0.892	0.871	0.82
gsm8k	0.942	0.968	0.948	0.833	0.844	0.572
hellaswag	0.891	0.92	0.908	0.874	0.768	0.462
human_eval	0.921	0.854	0.793	0.39	0.683	0.341
mmlu_humanities	0.802	0.818	0.795	0.706	0.619	0.56
mmlu_other	0.872	0.875	0.852	0.825	0.74	0.709
mmlu_social_sciences	0.913	0.898	0.878	0.872	0.761	0.741
mmlu_stem	0.696	0.831	0.771	0.696	0.595	0.561
openbookqa	0.882	0.908	0.936	0.928	0.852	0.802
piqa	0.844	0.874	0.862	0.894	0.801	0.764
social_iqa	0.79	0.797	0.813	0.789	0.734	0.667
truthfulqa_mc1	0.825	0.8	0.769	0.52	0.606	0.327
winogrande	0.822	0.867	0.845	0.776	0.65	0.56

As you can see above, leaked benchmarks reveal that Meta’s Llama 3.1 models outshine OpenAI’s GPT-4 in a variety of tests, establishing a new standard in several crucial areas of AI performance. Notably, Llama 3.1 excels in benchmarks such as GSM8K, Hellaswag, BoolQ, MMLU-humanities, MMLU-other, MMLU-STEM, and Winograd. However, it trails behind in the HumanEval and MMLU-social sciences tests, indicating areas where further refinement is needed.

It is critical to recognize that these benchmarks reflect the performance of the base models of Llama 3.1. The true potential of these models can be realized through instruction-tuning, a process that can significantly enhance their capabilities. The forthcoming Instruct versions of the Llama 3.1 models are expected to yield even better results, showcasing improvements across various benchmarks.

Meta AI Llama 3.1 405B surprisingly beats GPT-4o — Leaked benchmarks regarding Meta AI’s Llama 3.1 405B show that this open-source LLM has a lot of potential (Image credit)

Stressing out the importance of open-source initiatives

While GPT-5 may challenge Llama 3.1’s emerging dominance, the impressive performance of Llama 3.1 against GPT-4 underscores the growing influence and capability of open-source AI initiatives.

“We are embracing the open source ethos of releasing early and often to enable the community to get access to these models while they are still in development. The text-based models we are releasing today are the first in the Llama 3 collection of models. Our goal in the near future is to make Llama 3 multilingual and multimodal, have longer context, and continue to improve overall performance across core LLM capabilities such as reasoning and coding,” stated Meta in a blog post when launching Llama 3.

The significance of open-source AI cannot be overstated. By making their advanced models accessible to the public, Meta not only democratizes technology but also taps into the collective intelligence and diverse perspectives of the global developer community. This approach contrasts sharply with closed-source models, which are typically accessible only to a select group of users and researchers, thereby limiting the potential for widespread innovation and enhancement.

Featured image credit: Penfer/Unsplash

Tags: Benchmark gpt-4o LLama 3 meta AI

Meta AI’s Llama 3.1 405B surprisingly beats GPT-4o

Early benchmark data for the forthcoming Llama 3.1 models—including the 8B, 70B, and the colossal 405B—were leaked on the LocalLLaMA subreddit today

Related Posts

Amazon claims its new AI video summaries have “theatrical quality”

Perplexity launches free agentic shopping tool with PayPal

OpenAI says its new coding model can work for 24 hours straight

Meta drops SAM 3 and SAM 3D to turn text prompts into precise visual edits

Antigravity runs out of credits so fast developers are quitting immediately

You can now use GPT-5 and Claude together in one chaotic thread

LATEST NEWS

Amazon claims its new AI video summaries have “theatrical quality”

Google finally copies the best feature from Edge and Vivaldi

Perplexity launches free agentic shopping tool with PayPal

You should keep your Snapdragon 8 Gen 3 if you want to run emulators

Netflix grabs the Home Run Derby in fifty million dollar baseball deal