Llama 4 Maverick

400B parameters, 128 experts - Meta's most capable open model

Llama 4 Maverick is the first open-weight model to consistently outperform GPT-4o across multiple benchmark categories. With 400B total parameters routed through 128 specialized experts and only 17B active per token, it delivers frontier-class reasoning, coding, and multimodal understanding without the cost of a proprietary API. Whether you need advanced code generation, scientific analysis, or image understanding, Llama 4 Maverick brings the quality of closed-source leaders to an open-weight package you can deploy anywhere.

Start Chatting View benchmarks

Model variants

Instruction-tuned and base models

Choose between the instruction-tuned variant optimized for chat and complex tasks, or the base model for fine-tuning and research.

128-Expert MoE Architecture

400B total parameters, 17B active per token

Maverick scales to 128 experts from Scout's 16, packing 400B total parameters while keeping the same 17B active footprint per token. This gives it significantly stronger reasoning, coding, and multimodal capabilities.

The default chat model on this site. Best for tasks requiring maximum quality: complex reasoning, code generation, multimodal analysis, and research synthesis.

Start Chatting See capabilities

Instruction-tuned

Maverick Instruct

Optimized for conversational AI, complex reasoning, and code generation

Fine-tuned with RLHF for following instructions and multi-turn dialogue

Available now

Start Chatting Download weights

Pre-trained

Maverick Base

Foundation MoE model for fine-tuning and specialized applications

Pre-trained on diverse multimodal data with 128-expert routing

Available now

View on HuggingFace Documentation

Capabilities

Frontier performance from Llama 4 Maverick

Llama 4 Maverick combines 128-expert MoE efficiency with advanced reasoning, strong coding, and native multimodal understanding. Every capability is tuned for maximum quality at 17B active parameters per token, making it a practical alternative to proprietary frontier models.

128-expert MoE

Routes each token through specialized experts from a pool of 128. The 400B total parameters deliver frontier quality at only 17B inference cost per token. This architecture means each expert can develop deep specialization in specific domains, from mathematics to creative writing, resulting in consistently high quality across diverse tasks.

Advanced reasoning

Strong performance on MMLU Pro (80.5%) and GPQA Diamond (69.8%) demonstrates deep knowledge and scientific reasoning. Llama 4 Maverick handles multi-step logic, mathematical proofs, and complex analytical tasks with accuracy that matches or exceeds proprietary alternatives. The 128-expert architecture allows different experts to contribute specialized knowledge at each reasoning step.

Code generation

Outperforms GPT-4o on coding benchmarks including LiveCodeBench v5. Llama 4 Maverick generates production-ready code across dozens of programming languages, debugs complex issues, and explains algorithmic approaches clearly. Native function calling enables agentic workflows where the model can autonomously execute code, call APIs, and chain tool operations.

1M token context

Process long documents, codebases, and extended conversations within a 1 million token context window. While Scout offers 10M tokens for extreme long-context tasks, the 1M window in Llama 4 Maverick is sufficient for most production use cases including full project analysis, lengthy research papers, and multi-turn conversations that span hundreds of exchanges.

Native multimodal

Early fusion architecture processes text and images together natively from the ground up. Analyze screenshots, diagrams, charts, technical drawings, and documents alongside text without separate vision pipelines. Llama 4 Maverick scores 73.4% on MMMU, demonstrating strong visual reasoning that rivals dedicated vision models.

Multilingual

Strong performance across multiple languages makes Llama 4 Maverick suitable for global applications. The model handles translation, cross-lingual reasoning, and culturally nuanced content generation with consistent quality. Whether your users communicate in English, Chinese, Spanish, French, or other supported languages, the output quality remains high.

Key highlights

Why Llama 4 Maverick stands out

Llama 4 Maverick is the first open-weight model to consistently beat GPT-4o across multiple benchmark categories.

Benchmark highlights

MMLU Pro 80.5% - competitive with frontier proprietary models
GPQA Diamond 69.8% - strong scientific reasoning
MMMU 73.4% - excellent multimodal understanding
Outperforms GPT-4o on coding benchmarks
Arena ELO competitive with top-tier models

Technical specs

400B total parameters, 17B active per token
128 experts in MoE architecture
1M token context window
Native multimodal (text + image)
Llama 3.1 compatible license

Start Free Chat Download weights

Performance

Frontier quality from Llama 4 Maverick

Llama 4 Maverick achieves 80.5% on MMLU Pro and 73.4% on MMMU, outperforming GPT-4o on multiple benchmarks while activating only 17B parameters per token.

The benchmark results tell a compelling story, but real-world performance is where Llama 4 Maverick truly proves itself. Developers report that code generation quality rivals the best proprietary models, with fewer hallucinations and more accurate function implementations. Researchers find that scientific reasoning tasks produce well-structured, citation-aware responses. The 128-expert architecture means the model can draw on deeply specialized knowledge for each subtask, resulting in outputs that feel like they come from a domain expert rather than a generalist.

Start Chatting View model card

Llama 4 Maverick performance comparison chart

MMLU Pro 80.5% - frontier-class knowledge and reasoning

GPQA Diamond 69.8% - strong scientific reasoning

MMMU 73.4% - excellent multimodal understanding

Outperforms GPT-4o on coding benchmarks

17B active parameters from 400B total (128 experts)

Benchmark comparison

Maverick vs Scout and previous generation

Maverick's 128-expert architecture delivers significant improvements over Scout and Llama 3.1 across all categories.

Benchmark	Llama 4 Maverick 128 experts Featured	Llama 4 Scout 16 experts	Llama 3.1 70B Dense	GPT-4o Proprietary
MMLU Pro Knowledge & reasoning	80.5%	74.3%	66.4%	78.4%
GPQA Diamond Scientific knowledge	69.8%	57.2%	46.7%	53.6%
LiveCodeBench v5 Coding	43.4%	32.8%	28.5%	37.0%
MMMU Multimodal	73.4%	69.4%	-	69.1%
Context Window Max tokens	1M	10M	128K	128K
Total Parameters Model size	400B	109B	70B	-
Active Parameters Per token	17B	17B	70B	-

Data from Meta's official model card and independent evaluations.

128-Expert Scale

How Llama 4 Maverick delivers 400B capacity at 17B cost

The 128-expert MoE architecture in Llama 4 Maverick is a significant scale-up from Scout's 16 experts. Each token is routed to specialized experts, giving the model access to 400B parameters of knowledge while only activating 17B per forward pass. This design enables frontier-class quality without frontier-class compute requirements.

128 experts vs Scout's 16 for 8x more specialization per token
400B total parameters vs Scout's 109B for deeper knowledge capacity
Same 17B active parameter cost per token as Scout for efficient inference
Each expert develops deep domain specialization during training
Sparse routing ensures optimal expert selection for every input

Start Chatting View benchmarks

Llama 4 Maverick 128-expert MoE architecture

Multimodal

Native image understanding in Llama 4 Maverick

Llama 4 Maverick uses early fusion architecture to process text and images together natively. Visual understanding is built into the model from the ground up, not bolted on as a separate module. This results in seamless reasoning across both modalities with strong performance on visual benchmarks.

73.4% on MMMU multimodal benchmark, surpassing GPT-4o's 69.1%
Early fusion architecture for native multimodal processing without separate pipelines
Analyze screenshots, diagrams, charts, and technical documents with precision
Combine visual analysis with code generation for UI development workflows
Process mixed content documents containing both text and embedded images

Try multimodal chat Learn more

Llama 4 Maverick multimodal capabilities

Coding

Coding and function calling with Llama 4 Maverick

Llama 4 Maverick outperforms GPT-4o on coding benchmarks and includes native function calling for building autonomous agent workflows. Whether you need to generate production code, debug complex issues, or build tool-using agents, the 128-expert architecture provides specialized knowledge across programming languages and frameworks.

43.4% on LiveCodeBench v5, exceeding GPT-4o's 37.0% on the same benchmark
Native function calling enables autonomous agent workflows without fine-tuning
Generate production-ready code across Python, JavaScript, TypeScript, Rust, and more
Debug complex multi-file issues with full context awareness across your codebase
Chain multiple tool calls for end-to-end task automation in agentic applications

Get started

Try Llama 4 Maverick now

Start chatting instantly or download weights for self-hosted deployment.

Chat with Maverick

Try Llama 4 Maverick instantly - no setup required

Model card

Complete technical specifications and benchmarks

Documentation

Integration guides and best practices

Download & deploy

Self-hosted deployment

Download official model weights for deployment on your infrastructure.

Hugging Face

Official Llama 4 Maverick model repository

Ollama

Run locally with Ollama

GitHub

Source code and examples

FAQ

Frequently asked questions about Llama 4 Maverick

Answers to the most common questions about performance, deployment, and practical usage of Llama 4 Maverick.

Does Llama 4 Maverick really beat GPT-4o on benchmarks?

Yes. Llama 4 Maverick outperforms GPT-4o on several key benchmarks. It scores 80.5% on MMLU Pro compared to GPT-4o's 78.4%, 69.8% on GPQA Diamond versus 53.6%, and 43.4% on LiveCodeBench v5 versus 37.0%. On multimodal tasks, it achieves 73.4% on MMMU compared to GPT-4o's 69.1%. These results come from Meta's official evaluations and independent testing.

How many GPUs do you need to run Llama 4 Maverick?

Running Llama 4 Maverick at full precision requires approximately 800 GB of VRAM, which typically means a cluster of 8 or more A100 80 GB GPUs. With INT8 quantization, you can reduce this to around 400 GB (roughly 5 A100 GPUs). INT4 quantization brings it down further to approximately 200 GB. Cloud providers also offer hosted API access if local deployment is not practical for your setup.

What makes the 128 expert architecture special in Llama 4 Maverick?

The 128-expert mixture of experts architecture allows Llama 4 Maverick to store 400B parameters of knowledge while only activating 17B per token during inference. Each expert develops deep specialization during training, so the routing mechanism can select the most relevant experts for each input. This gives the model the knowledge depth of a 400B dense model at a fraction of the compute cost.

Can I use Llama 4 Maverick for commercial projects?

Yes. Llama 4 Maverick is released under the Llama 3.1 compatible license, which permits commercial use. You can build products, deploy services, and fine-tune the model for your specific business needs. The license includes usage thresholds for very large-scale deployments, so review the full terms if your application serves hundreds of millions of monthly active users.

How does Llama 4 Maverick handle image understanding?

Llama 4 Maverick uses early fusion architecture, meaning image understanding is built into the model from the ground up rather than added as a separate vision encoder. It processes text and images in a unified stream, enabling natural reasoning across both modalities. It scores 73.4% on MMMU, demonstrating strong performance on tasks that require understanding charts, diagrams, screenshots, and documents.

What is the best way to access Llama 4 Maverick through an API?

Several cloud providers offer hosted API access to Llama 4 Maverick, including services on AWS, Google Cloud, Azure, and specialized inference platforms like Together AI, Fireworks, and Groq. You can also self-host using frameworks like vLLM or TGI. For quick experimentation, the chat interface on this site runs Llama 4 Maverick as the default model with no setup required.

Llama 4 Family