Llama 4 Models
Two models, one family - from long context to frontier quality
The Llama 4 models represent Meta's most ambitious open-weight release to date. This family features two mixture of experts architectures designed for different priorities: Scout handles massive context with a 10 million token window across 16 experts, while Maverick delivers frontier-class quality through 128 experts and 400B total parameters. Both share 17B active parameters per token and native multimodal support, giving you the flexibility to choose the right balance of context length and output quality for your specific workflow.
All models
Choose the right option from the Llama 4 models
Scout and Maverick are optimized for different scenarios. Understanding their strengths helps you pick the variant that matches your workload, whether that means processing entire codebases or generating the highest quality reasoning and code.
Llama 4 Scout
10M context window - the long-context specialist
109B total parameters across 16 experts with 17B active per token. The standout feature is its 10 million token context window, the longest of any openly available model. Scout excels when your task requires ingesting large volumes of information at once, from entire repositories to multi-document research collections. Needle-in-a-haystack tests confirm 95% retrieval accuracy up to 8 million tokens.
Choose Scout when you need to process entire codebases, multi-document research sets, or very long conversation histories in a single call. It is the best option when context length matters more than marginal quality differences.
Llama 4 Maverick
128 experts, 400B parameters - the quality flagship
400B total parameters across 128 experts with 17B active per token. Maverick outperforms GPT-4o on key benchmarks including MMLU Pro, GPQA Diamond, and LiveCodeBench. The 128-expert architecture provides deep specialization across domains, making it the strongest open-weight model available for reasoning, coding, and multimodal tasks. It offers a 1M token context window for most production needs.
Choose Maverick when you need maximum quality for reasoning, coding, multimodal analysis, and complex task completion. It is the default chat model on this site for good reason.
Long Context
Llama 4 Scout
109B total, 17B active, 16 experts. 10M token context window.
Best for: entire codebases, multi-document analysis, long research papers, extended conversations.
Flagship
Llama 4 Maverick
400B total, 17B active, 128 experts. Beats GPT-4o on benchmarks.
Best for: complex reasoning, code generation, multimodal tasks, research synthesis.
Shared capabilities
What all Llama 4 models can do
Scout and Maverick share a common set of capabilities built on Meta's mixture of experts architecture. These shared foundations mean you can switch between the two variants without changing your integration code.
Native multimodal
Both Llama 4 models process text and images natively with early fusion architecture. Visual understanding is built in from the ground up, not added as a separate encoder. This means you can send mixed content, including screenshots, diagrams, and documents alongside text, and get coherent reasoning across both modalities.
MoE efficiency
Both Llama 4 models activate only 17B parameters per token despite their large total parameter counts. Scout uses 16 experts with 109B total, Maverick uses 128 experts with 400B total. This sparse routing strategy delivers strong performance at a fraction of the compute cost of equivalent dense architectures.
Function calling
Built-in function calling across both Llama 4 models enables agentic workflows without additional fine-tuning. Define your tools, and the model will decide when and how to call them. This makes it straightforward to build autonomous agents that query databases, call APIs, execute code, and chain operations together.
Extended context
Scout offers a 10M token context window for extreme long-document tasks, while Maverick provides 1M tokens for most production scenarios. Both far exceed the 128K limit of previous generation models, giving you room to include more context, more examples, and more history in every request.
Multilingual
Strong multilingual support across both Llama 4 models enables global applications. Whether your users communicate in English, Chinese, Spanish, French, or other supported languages, both variants maintain consistent quality with culturally aware responses.
Open weights
Both Llama 4 models are fully open-weight under the Llama 3.1 compatible license. Deploy anywhere, modify freely, and fine-tune for your specific needs. This openness means no vendor lock-in, full transparency into model behavior, and the ability to run entirely on your own infrastructure.
Quick selection guide
Which of the Llama 4 models should you choose?
Match your primary use case to the right variant.
Choose Scout when
- You need to process very long documents (10M tokens)
- Entire codebase analysis across hundreds of files
- Multi-document research and synthesis
- Extended conversation histories
- Lower memory requirements (109B vs 400B total)
Choose Maverick when
- Maximum quality is the priority
- Complex reasoning and scientific tasks
- Code generation and debugging
- Multimodal analysis (screenshots, diagrams)
- Tasks where benchmark performance matters most
Performance
Complete benchmark comparison across Llama 4 models
Scout optimizes for context length, Maverick for raw quality. Both deliver strong performance relative to their design goals.
Choosing between the Llama 4 models comes down to your primary need. If your workflow involves processing large volumes of text, code, or documents in a single call, Scout's 10M token context window is unmatched. If you need the highest possible quality for reasoning, coding, or multimodal tasks, Maverick's 128-expert architecture delivers frontier-class results that compete with the best proprietary offerings. Many teams use both: Maverick for quality-critical tasks and Scout for large-scale analysis.
Maverick: 80.5% MMLU Pro, 73.4% MMMU, beats GPT-4o on coding
Scout: 10M token context, 95%+ retrieval at 8M tokens
Both: 17B active parameters, native multimodal, function calling
Both: open-weight under Llama 3.1 compatible license
Full comparison
Scout vs Maverick side by side
Complete benchmark results across reasoning, coding, multimodal, and deployment metrics.
| Benchmark | Maverick 128 experts Flagship | Scout 16 experts Long Context |
|---|---|---|
MMLU Pro Knowledge & reasoning | 80.5% | 74.3% |
GPQA Diamond Scientific knowledge | 69.8% | 57.2% |
LiveCodeBench v5 Coding | 43.4% | 32.8% |
MMMU Multimodal | 73.4% | 69.4% |
Context Window Max tokens | 1M | 10M |
Total Parameters Model size | 400B | 109B |
Active Parameters Per token | 17B | 17B |
Number of Experts MoE routing | 128 | 16 |
Data from Meta's official model card and independent evaluations.
Scout
Llama 4 Scout: when context length is everything
Scout's 10M token context window is unmatched among the Llama 4 models and across the entire open-weight landscape. It can process entire codebases, multi-document research sets, and hours of transcripts in a single call. If your task involves very long inputs, Scout is the clear choice.
- 10M token context, the longest of any open model available today
- 95%+ retrieval accuracy up to 8M tokens in needle-in-a-haystack tests
- 109B total parameters across 16 experts with 17B active per token
- Process entire GitHub repositories for comprehensive code review
- Ideal for legal document analysis, research synthesis, and audit workflows
Maverick
Llama 4 Maverick: when quality is the priority
Maverick's 128-expert architecture delivers frontier-class performance that outperforms GPT-4o on key benchmarks. It is the default model on this site for good reason: it handles complex reasoning, coding, and multimodal tasks with the quality you would expect from the best proprietary alternatives.
- 80.5% MMLU Pro for frontier-class knowledge and reasoning
- Outperforms GPT-4o on coding benchmarks with 43.4% on LiveCodeBench v5
- 400B total parameters across 128 experts for deep domain specialization
- 73.4% on MMMU for strong multimodal understanding of images and documents
- Native function calling for building autonomous agent workflows
Selection Guide
Choosing the right option from the Llama 4 models
Picking between the Llama 4 models depends on what matters most for your specific workflow. Both share the same 17B active parameter footprint and native multimodal support, so the decision comes down to context length versus output quality. Many teams find value in using both variants for different parts of their pipeline.
- Pick Scout for tasks that require processing more than 1 million tokens at once
- Pick Maverick for tasks where output quality and reasoning depth matter most
- Both share 17B active parameters, so inference cost per token is comparable
- Use Scout for ingestion and analysis, then Maverick for synthesis and generation
- Both run under the same open-weight license, so you can deploy either or both freely
Try now
Start chatting with Llama 4
Try both models instantly through our chat interface.
Download
Get model weights
Download official weights for either Llama 4 variant.
FAQ
Frequently asked questions about Llama 4 models
Answers to the most common questions about choosing, running, and deploying the Llama 4 models for your projects.
There are currently two Llama 4 models: Scout and Maverick. Each comes in two variants, an instruction-tuned version optimized for chat and task completion, and a base pre-trained version for fine-tuning and research. That gives you four total checkpoints to choose from depending on whether you need a ready-to-use conversational model or a foundation for custom training.
Maverick is the stronger choice for coding tasks. It scores 43.4% on LiveCodeBench v5, outperforming both Scout (32.8%) and GPT-4o (37.0%). The 128-expert architecture provides deep specialization across programming languages and frameworks. However, if you need to analyze an entire large codebase at once, Scout's 10M token context window lets you load everything into a single call for cross-file analysis.
Running the full versions requires multi-GPU setups. Scout needs approximately 220 GB of VRAM at full precision, and Maverick needs around 800 GB. However, quantized versions significantly reduce these requirements. Scout with INT4 quantization can fit on roughly 55 GB, which is achievable with high-end consumer GPUs. Maverick with INT4 still needs around 200 GB, making it more suited to cloud or enterprise hardware.
Scout is optimized for long-context tasks with a 10M token window and 16 experts (109B total parameters). Maverick prioritizes output quality with 128 experts and 400B total parameters but has a 1M token context window. Both activate 17B parameters per token. Think of Scout as the wide-angle lens and Maverick as the high-resolution lens in the same camera system.
Yes. All Llama 4 models are released under the Llama 3.1 compatible license, which permits commercial use, fine-tuning, and redistribution. You can deploy them on your own infrastructure, build products on top of them, and modify the weights for your specific needs. The license includes usage thresholds for very large-scale deployments serving hundreds of millions of users.
It depends on the volume and complexity of your documents. For analyzing large collections of documents, contracts, or research papers in a single pass, Scout's 10M token context window is ideal. For shorter documents where you need the highest quality extraction, summarization, or reasoning, Maverick's 128-expert architecture produces more nuanced and accurate results. Both support native image understanding for documents with charts, tables, and diagrams.
Llama 4 Family
Explore each model and compare with competitors
Dive deeper into each variant or see how the Llama 4 models compare against other frontier open models.
Get started
Find your ideal option among the Llama 4 models
Start chatting with either variant for free, or download weights for local deployment. Both are open-weight and ready to use.