Llama 3.2 Instruct 11B (Vision)

← AI Models
Meta
2024-09-25
Modality:
Intelligence
8.7
#471/523
Coding
4.3
#388/429
Math
1.7
#261/265
Speed
85 tok/s
TTFT: 406.00s
Pricing
$0.16 / $0.16
per 1M tokens (in/out)
Google Preferred Source

Llama 3.2 Instruct 11B (Vision) is Meta’s latest model designed for a variety of tasks involving visual and textual data processing. It operates at a speed of 84.541 tokens per second and is priced at $0.16 per million tokens, making it suitable for professional users seeking efficient performance.

When to Use Llama 3.2 Instruct 11B (Vision)

✓ Best For

  • Image recognition and analysis
  • Text generation based on visual inputs
  • Interactive applications requiring multimodal understanding

✗ Not Ideal For

  • Complex mathematical problem solving
  • High-speed processing in real-time applications

How Llama 3.2 Instruct 11B (Vision) Compares

Intelligence Index · Higher is better

SnowflakeAlibabaMetaGoogleIBM

Benchmark Profile

Coding Index

GoogleAllen Institute for AIMetaAmazon

Output Speed · tok/s

AlibabaNous ResearchMetaTencentReka AI

Math Index

AI21 LabsGoogleMetaIBMAllen Institute for AI

Intelligence · Coding · Math

Intelligence Coding Math

All Benchmark Scores (15)

BenchmarkScore
Intelligence Index 8.7
Coding Index 4.3
Math Index 1.7
MMLU-Pro 464%
GPQA 221%
LiveCodeBench 11%
HLE 52%
SciCode 11.2%
IFBench 30.4%
LCR 11.7%
TerminalBench Hard 0.8%
Tau2 14.6%
AIME 9.3%
AIME 2025 1.7%
MATH 500 51.6%

Data: Artificial Analysis · Updated: April 9, 2026

Frequently Asked Questions (15)

When was Llama 3.2 Instruct 11B (Vision) released?
Llama 3.2 Instruct 11B (Vision) was released on September 25, 2024.
Who created Llama 3.2 Instruct 11B (Vision)?
Llama 3.2 Instruct 11B (Vision) was created by Meta.
How intelligent is Llama 3.2 Instruct 11B (Vision)?
Llama 3.2 Instruct 11B (Vision) scores 9 on the Artificial Analysis Intelligence Index, placing it at the lower end among other open weight non-reasoning models of similar size (median: 11).
How fast is Llama 3.2 Instruct 11B (Vision)?
Llama 3.2 Instruct 11B (Vision) generates output at 51.4 tokens per second (based on the median across providers serving the model), which is at the lower end compared to other open weight non-reasoning models of similar size (median: 98.3 t/s).
What is the latency of Llama 3.2 Instruct 11B (Vision)?
Llama 3.2 Instruct 11B (Vision) has a time to first token (TTFT) of 0.76s (based on the median across providers serving the model), which is very competitive compared to other open weight non-reasoning models of similar size (median: 1.69s).
How much does Llama 3.2 Instruct 11B (Vision) cost?
Llama 3.2 Instruct 11B (Vision) costs $0.16 per 1M input tokens (somewhat higher than average, median: $0.15) and $0.16 per 1M output tokens (very competitive, median: $0.30), based on the median across providers serving the model.
What is Llama 3.2 Instruct 11B (Vision) API pricing?
Llama 3.2 Instruct 11B (Vision) costs $0.16 per 1M input tokens and $0.16 per 1M output tokens (based on the median across providers serving the model). For a blended rate (3:1 input to output ratio), this is $0.16 per 1M tokens. Pricing may vary by provider.
How verbose is Llama 3.2 Instruct 11B (Vision)?
When evaluated on the Intelligence Index, Llama 3.2 Instruct 11B (Vision) generated 5.8M output tokens, which is better than average compared to other open weight non-reasoning models of similar size (median: 8.5M).
Is Llama 3.2 Instruct 11B (Vision) a reasoning model?
No, Llama 3.2 Instruct 11B (Vision) is not a reasoning model. It provides direct responses without extended chain-of-thought reasoning.
What input modalities does Llama 3.2 Instruct 11B (Vision) support?
Llama 3.2 Instruct 11B (Vision) supports image input.
What output modalities does Llama 3.2 Instruct 11B (Vision) support?
Llama 3.2 Instruct 11B (Vision) supports text only output.
Can Llama 3.2 Instruct 11B (Vision) process images?
Yes, Llama 3.2 Instruct 11B (Vision) supports image input and can analyze, describe, and answer questions about images.
Is Llama 3.2 Instruct 11B (Vision) multimodal?
No, Llama 3.2 Instruct 11B (Vision) is not multimodal. It only supports image input.
What is the context window of Llama 3.2 Instruct 11B (Vision)?
Llama 3.2 Instruct 11B (Vision) has a context window of 130k tokens. This determines how much text and conversation history the model can process in a single request.
Is Llama 3.2 Instruct 11B (Vision) open source?
Yes, Llama 3.2 Instruct 11B (Vision) is open weights. The model weights are publicly available and can be downloaded for self-hosting.