Unveiling one of the best large language models, OpenAI’s ChatGPT, has provoked a competitive surge in the AI field. A diverse tapestry of participants, ranging from imposing corporate giants to ambitious startups, and extending to the altruistic open-source community, is deeply engrossed in the exciting endeavor to innovate the most advanced large language models.
In the bustling realm of technology in 2023, it’s an inescapable truth: one cannot neglect the revolutionary influence of trending phenomena such as Generative AI and the mighty large language models (LLMs) that fuel the intellect of AI chatbots.
In a whirlwind of such competition, there have already been a plethora of LLMs unveiled – hundreds, in fact. Amid this dizzying array, the key question persists: which models truly stand out as the most proficient? Which are worthy of being crowned among the best large language models? To offer some clarity, we embark on a revealing journey through the finest proprietary and open-source large language models in 2023.
Best large language models (LLMs)
Now, we delve into an eclectic collection of some of the best large language models that are leading the charge in 2023. Rather than offering a strict ranking from the best to the least effective, we present an unbiased compilation of LLMs, each uniquely tailored to serve distinct purposes. This list celebrates the diversity and broad range of capabilities housed within the domain of large language models, opening a window into the intricate world of AI.
GPT-4
The vanguard of AI large language models in 2023 is without a doubt, OpenAI’s GPT-4. Unveiled in March of that year, this model has demonstrated astonishing capabilities: it possesses a deep comprehension of complex reasoning, advanced coding abilities, excels in a multitude of academic evaluations, and demonstrates many other competencies that echo human-level performance. Remarkably, GPT-4 is the first model to incorporate a multimodal capability, accepting both text and image inputs. Although ChatGPT hasn’t yet inherited this multimodal ability, some fortunate users have experienced it via Bing Chat, which leverages the power of the GPT-4 model.
GPT-4 has substantially addressed and improved upon the issue of hallucination, a considerable leap in maintaining factuality. When pitted against its predecessor, ChatGPT-3.5, the GPT-4 model achieves a score nearing 80% in factual evaluations across numerous categories. OpenAI has invested significant effort to align the GPT-4 model more closely with human values, employing Reinforcement Learning from Human Feedback (RLHF) and domain-expert adversarial testing.
GPT-4 API is now generally available
This titan, trained on a colossal 1+ trillion parameters, boasts a maximum context length of 32,768 tokens. The internal architecture of GPT-4, once a mystery, was unveiled by George Hotz of The Tiny Corp. GPT-4 is a unique blend of eight distinct models, each comprising 220 billion parameters. Consequently, it deviates from the traditional single, dense model we initially believed it to be.
Engaging with GPT-4 is achievable through ChatGPT plugins or web browsing via Bing. Despite its few drawbacks, such as a slower response and higher inference time leading some developers to opt for the GPT-3.5 model, the GPT-4 model stands unchallenged as the best large language model available in 2023. For serious applications, it’s highly recommended to subscribe to ChatGPT Plus, available for $20. Alternatively, for those preferring not to pay, third-party portals offer access to ChatGPT 4 for free.
GPT-3.5
Hot on the heels of GPT-4, OpenAI holds its ground with the GPT-3.5 model, taking a respectable second place. GPT-3.5 is a general-purpose LLM, akin to GPT-4, albeit lacking in specialized domain expertise. Its key advantage lies in its remarkable speed; it formulates complete responses within mere seconds.
From creative tasks like crafting essays with ChatGPT to devising business plans, GPT-3.5 performs admirably. OpenAI has also extended the context length to a generous 16K for the GPT-3.5-turbo model. Adding to its appeal, it’s free to use without any hourly or daily restrictions.
ChatGPT down: What to do if ChatGPT is not working
However, GPT-3.5 does exhibit some shortcomings. Its tendency to hallucinate results in the frequent propagation of incorrect information, making it less suitable for serious research work. Despite this, for basic coding queries, translation, comprehension of scientific concepts, and creative endeavors, GPT-3.5 holds its own.
GPT-3.5’s performance on the HumanEval benchmark yielded a score of 48.1%, while its more advanced sibling, GPT-4, secured a higher score of 67%. This distinction stems from the fact that while GPT-3.5 was trained on 175 billion parameters, GPT-4 had the advantage of being trained on over 1 trillion parameters.
PaLM 2 (Bison-001)
Carving its own niche among the best large language models of 2023, we find Google’s PaLM 2. Google has enriched this model by concentrating on aspects such as commonsense reasoning, formal logic, mathematics, and advanced coding across a diverse set of over 20 languages. The most expansive iteration of PaLM 2 is reportedly trained on 540 billion parameters, boasting a maximum context length of 4096 tokens.
Google has introduced a quartet of models based on the PaLM 2 framework, in varying sizes (Gecko, Otter, Bison, and Unicorn). Currently, Bison is the available offering. In the MT-Bench test, Bison secured a score of 6.40, somewhat overshadowed by GPT-4’s impressive 8.99 points. However, in reasoning evaluations, such as WinoGrande, StrategyQA, XCOPA, and similar tests, PaLM 2 exhibits a stellar performance, even surpassing GPT-4. Its multilingual capabilities enable it to understand idioms, riddles, and nuanced texts from various languages – a feat other LLMs find challenging.
PaLM 2 also offers the advantage of quick responses, providing three at a time. Users can test the PaLM 2 (Bison-001) model on Google’s Vertex AI platform, as detailed in our article. For consumer usage, Google Bard, powered by PaLM 2, is the way to go.
Codex
OpenAI Codex, an offspring of GPT-3, shines in the realms of programming, writing, and data analysis. Launched in conjunction with GitHub for GitHub Copilot, Codex displays proficiency in over a dozen programming languages. This model can interpret straightforward commands in natural language and execute them, paving the way for natural language interfaces for existing applications. Codex shows exceptional aptitude in Python, extending its capabilities to languages such as JavaScript, Go, Perl, PHP, Ruby, Swift, TypeScript, and Shell. With an expanded memory of 14KB for Python code, Codex vastly outperforms GPT-3 by factoring in over three times the contextual information during task execution.
Text-ada-001
Also known as Text-ada-001, Ada represents a fast and cost-effective model in the GPT-3 series, crafted for simpler tasks. As the quickest and most affordable option, Ada lands on the less complex end of the capabilities spectrum. Other models like Curie (text-curie-001) and Babbage (text-babbage-001) provide intermediate capabilities. Variations of Ada text modules, such as Text-similarity-ada-001, Text-search-ada-doc-001, and Code-search-ada-text-001, each carry unique strengths and limitations concerning quality, speed, and availability. This article delves into a comprehensive understanding of these modules and their relevance to specific requirements, positioning Text-ada-001 as well-suited for tasks like text parsing, address correction, and simple classification.
Claude v1
Emerging from the stables of Anthropic, a company receiving support from Google and co-founded by former OpenAI employees, is Claude – an impressive contender among the best large language models of 2023. The company’s mission is to create AI assistants that embody helpfulness, honesty, and harmlessness. Anthropic’s Claude v1 and Claude Instant models have shown tremendous potential in various benchmark tests, even outperforming PaLM 2 in the MMLU and MT-Bench examinations.
Claude v1 delivers an impressive performance, not far from GPT-4, scoring 7.94 in the MT-Bench test (compared to GPT-4’s 8.99). It secures 75.6 points in the MMLU benchmark, slightly behind GPT-4’s 86.4. Anthropic made a pioneering move by offering a 100k token as the largest context window in its Claude-instant-100k model. This allows users to load close to 75,000 words in a single window – a feat that is truly mind-boggling. Interested readers can learn how to use Anthropic’s Claude via our detailed tutorial.
Text-babbage-001
Best suited for moderate classification and semantic search classification tasks, Text-babbage-001, a GPT-3 language model, is known for its nimble response time and lower costs in comparison to other models. If you want to link your repository with the topic of text-babbage-001, you can easily do so by visiting your repository’s landing page and selecting the “manage topics” option.
Cohere
Founded by former Google Brain team members, including Aidan Gomez, a co-author of the influential “Attention is all you Need” paper that introduced the Transformer architecture, Cohere is an AI startup targeting enterprise customers. Unlike other AI companies, Cohere focuses on resolving generative AI use cases for corporations. Its range of models varies from small ones, with just 6B parameters, to large models trained on 52B parameters.
The recent Cohere Command model is gaining acclaim for its accuracy and robustness. According to Stanford HELM, the Cohere Command model holds the highest accuracy score among its peers. Corporations like Spotify, Jasper, and HyperWrite employ Cohere’s model to deliver their AI experience.
In terms of pricing, Cohere charges $15 to generate 1 million tokens, while OpenAI’s turbo model charges $4 for the same quantity. However, Cohere offers superior accuracy compared to other LLMs. Therefore, if you are a business seeking the best large language model to integrate into your product, Cohere’s models deserve your attention.
Text-curie-001
Best suited for tasks like language translation, complex classification, text sentiment analysis, and summarization, Text-curie-001 is a competent language model that falls under the GPT-3 series. Introduced in June 2020, this model excels in speed and cost-effectiveness compared to Davinci. With 6.7 billion parameters, Text-curie-001 is built for efficiency while maintaining a robust set of capabilities. It stands out in various natural language processing tasks and serves as a versatile choice for processing text-based data.
Text-davinci-003
Designed for tasks such as complex intent recognition, cause and effect understanding, and audience-specific summarization, Text-davinci-003 is a language model with capabilities parallel to text-davinci-003 but utilizes a different training approach. This model adopts supervised fine-tuning instead of reinforcement learning. As a result, it surpasses the curie, babbage, and ada models in terms of quality, output length, and consistent adherence to instructions. It also offers extra features like the ability to insert text.
Alpaca-7b
Primarily useful for conversing, writing and analyzing code, generating text and content, and querying specific information, Stanford’s Alpaca and LLaMA models aim to overcome the limitations of ChatGPT by facilitating the creation of custom AI chatbots that function locally and are consistently available offline. These models empower users to construct AI chatbots tailored to their individual requirements, free from dependencies on external servers or connectivity concerns.
Alpaca exhibits behavior similar to text-davinci-003, while being smaller, more cost-effective, and easy to replicate. The training recipe for this model involves using strong pre-trained language models and high-quality instruction data generated from OpenAI’s text-davinci-003. Although the model is released for academic research purposes, it highlights the necessity of further evaluation and reporting on any troubling behaviors.
StableLM-Tuned-Alpha-7B
Ideal for conversational tasks like chatbots, question-answering systems, and dialogue generation, StableLM-Tuned-Alpha-7B is a decoder-only language model with 7 billion parameters. It builds upon the StableLM-Base-Alpha models and is fine-tuned further on chat and instruction-following datasets. Utilizing a new dataset derived from The Pile, it has an enormous size, containing approximately 1.5 trillion tokens. This model has also been fine-tuned using datasets from multiple AI research entities and will be released as StableLM-Tuned-Alpha.
30B-Lazarus
The 30B-Lazarus model by CalderaAI, grounded on the LLaMA model, has been trained using LoRA-tuned datasets from a diverse array of models. Due to this, it performs exceptionally well on many LLM benchmarks. If your use case primarily involves text generation and not conversational chat, the 30B Lazarus model may be a sound choice.
Open-Assistant SFT-4 12B
Intended for functioning as an assistant, responding to user queries with helpful answers, the Open-Assistant SFT-4 12B is the fourth iteration of the Open-Assistant project. Derived from a Pythia 12B model, it has been fine-tuned on human demonstrations of assistant conversations collected through an application. This open-source chatbot, an alternative to ChatGPT, is now accessible free of charge.
WizardLM
Built to follow complex instructions, WizardLM is a promising open-source large language model. Developed by a team of AI researchers using an Evol-instruct approach, this model can rewrite initial sets of instructions into more complex ones. The generated instruction data is then used to fine-tune the LLaMA model.
FLAN-UL2
Created to provide a reliable and scalable method for pre-training models that excel across a variety of tasks and datasets, FLAN-UL2 is an encoder-decoder model grounded on the T5 architecture. This model, a fine-tuned version of the UL2 model, shows significant improvements. It has an extended receptive field of 2048, simplifying inference and fine-tuning processes, making it more suited for few-shot in-context learning. The FLAN datasets and methods have been open-sourced, promoting effective instruction tuning.
GPT-NeoX-20b
Best used for a vast array of natural language processing tasks, GPT-NeoX-20B is a dense autoregressive language model with 20 billion parameters. This model, trained on the Pile dataset, is currently the largest autoregressive model with publicly accessible weights. With the ability to compete in language-understanding, mathematics, and knowledge-based tasks, the GPT-NeoX-20B model utilizes a different tokenizer than GPT-J-6B and GPT-Neo. Its enhanced suitability for tasks like code generation stems from the allocation of extra tokens for whitespace characters.
BLOOM
Optimized for text generation and exploring characteristics of language generated by a language model, BLOOM is a BigScience Large Open-science Open-access Multilingual Language Model funded by the French government. This autoregressive model can generate coherent text in 46 natural languages and 13 programming languages and can perform text tasks that it wasn’t explicitly trained for. Despite its potential risks and limitations, BLOOM opens avenues for public research on large language models and can be utilized by a diverse range of users including researchers, students, educators, engineers/developers, and non-commercial entities.
BLOOMZ
Ideal for performing tasks expressed in natural language, BLOOMZ and mT0 are Bigscience-developed models that can follow human instructions in multiple languages without prior training. These models, fine-tuned on a cross-lingual task mixture known as xP3, can generalize across different tasks and languages. However, performance may vary depending on the prompt provided. To ensure accurate results, it’s advised to clearly indicate the end of the input and to provide sufficient context. These measures can significantly improve the models’ accuracy and effectiveness in generating appropriate responses to user instructions.
FLAN-T5-XXL
Best utilized for advancing research on language models, FLAN-T5-XXL is a powerful tool in the field of zero-shot and few-shot learning, reasoning, and question-answering. This language model surpasses T5 by being fine-tuned on over 1000 additional tasks and encompassing more languages. It’s dedicated to promoting fairness and safety research, as well as mitigating the limitations of current large language models. However, potential harmful usage of language models like FLAN-T5-XXL necessitates careful safety and fairness evaluations before application.
Command-medium-nightly
Ideal for developers who require rapid response times, such as those building chatbots, Cohere’s Command-medium-nightly is the regularly updated version of the command model. These nightly versions assure continuous performance enhancements and optimizations, making them a valuable tool for developers.
Falcon
Falcon, open-sourced under an Apache 2.0 license, is available for commercial use without any royalties or restrictions. The Falcon-40B-Instruct model, fine-tuned for most use cases, is particularly useful for chatting applications.
Gopher – Deepmind
Deepmind’s Gopher is a 280 billion parameter model exhibiting extraordinary language understanding and generation capabilities. Gopher excels in various fields, including math, science, technology, humanities, and medicine, and is adept at simplifying complex subjects during dialogue-based interactions. It’s a valuable tool for reading comprehension, fact-checking, and understanding toxic language and logical/common sense tasks.
Vicuna 33B
Vicuna 33B, derived from LLaMA and fine-tuned using supervised instruction, is ideal for chatbot development, research, and hobby use. This auto-regressive large language model has been trained on 33 billion parameters, using data collected from sharegpt.com.
Jurassic-2
The Jurassic-2 family, including the Large, Grande, and Jumbo base language models, excels at reading and writing-related use cases. With the introduction of zero-shot instruction capabilities, the Jurassic-2 models can be guided with natural language without the use of examples. They have demonstrated promising results on Stanford’s Holistic Evaluation of Language Models (HELM), the leading benchmark for language models.
LLM cosmos and wordsmith bots
In the rich tapestry of the artificial intelligence and natural language processing world, Large Language Models (LLMs) emerge as vibrant threads weaving an intricate pattern of advancements. The number of these models is not static; it’s an ever-expanding cosmos with new stars born daily, each embodying their unique properties and distinctive functionalities.
Each LLM acts as a prism, diffracting the raw light of data into a spectrum of insightful information. They boast specific abilities, designed and honed for niche applications. Whether it’s the intricate art of decoding labyrinthine instructions, scouring vast data galaxies to extract relevant patterns, or translating the cryptic languages of code into human-readable narratives, each model holds a unique key to unlock these capabilities.
Not all models are created equal. Some are swift as hares, designed to offer rapid response times, meeting the demands of real-time applications, such as the vibrant, chatty world of chatbot development. Others are more like patient, meticulous scholars, dedicated to unraveling complex topics into digestible knowledge nuggets, aiding the pursuit of academic research or providing intuitive explanations for complex theories.
All images in this post, including the featured image, is created by Kerem Gülen using Midjourney