Llama 4 Models

Two models, one family - from long context to frontier quality

The Llama 4 models represent Meta's most ambitious open-weight release to date. This family features two mixture of experts architectures designed for different priorities: Scout handles massive context with a 10 million token window across 16 experts, while Maverick delivers frontier-class quality through 128 experts and 400B total parameters. Both share 17B active parameters per token and native multimodal support, giving you the flexibility to choose the right balance of context length and output quality for your specific workflow.

Start Chatting Compare Models

All models

Choose the right option from the Llama 4 models

Scout and Maverick are optimized for different scenarios. Understanding their strengths helps you pick the variant that matches your workload, whether that means processing entire codebases or generating the highest quality reasoning and code.

Llama 4 Scout

10M context window - the long-context specialist

109B total parameters across 16 experts with 17B active per token. The standout feature is its 10 million token context window, the longest of any openly available model. Scout excels when your task requires ingesting large volumes of information at once, from entire repositories to multi-document research collections. Needle-in-a-haystack tests confirm 95% retrieval accuracy up to 8 million tokens.

Choose Scout when you need to process entire codebases, multi-document research sets, or very long conversation histories in a single call. It is the best option when context length matters more than marginal quality differences.

Try Scout Learn more

Llama 4 Maverick

128 experts, 400B parameters - the quality flagship

400B total parameters across 128 experts with 17B active per token. Maverick outperforms GPT-4o on key benchmarks including MMLU Pro, GPQA Diamond, and LiveCodeBench. The 128-expert architecture provides deep specialization across domains, making it the strongest open-weight model available for reasoning, coding, and multimodal tasks. It offers a 1M token context window for most production needs.

Choose Maverick when you need maximum quality for reasoning, coding, multimodal analysis, and complex task completion. It is the default chat model on this site for good reason.

Try Maverick Learn more

Long Context

Llama 4 Scout

109B total, 17B active, 16 experts. 10M token context window.

Best for: entire codebases, multi-document analysis, long research papers, extended conversations.

Available now

Learn more Download

Flagship

Llama 4 Maverick

400B total, 17B active, 128 experts. Beats GPT-4o on benchmarks.

Best for: complex reasoning, code generation, multimodal tasks, research synthesis.

Available now

Learn more Download

Shared capabilities

What all Llama 4 models can do

Scout and Maverick share a common set of capabilities built on Meta's mixture of experts architecture. These shared foundations mean you can switch between the two variants without changing your integration code.

Native multimodal

Both Llama 4 models process text and images natively with early fusion architecture. Visual understanding is built in from the ground up, not added as a separate encoder. This means you can send mixed content, including screenshots, diagrams, and documents alongside text, and get coherent reasoning across both modalities.

MoE efficiency

Both Llama 4 models activate only 17B parameters per token despite their large total parameter counts. Scout uses 16 experts with 109B total, Maverick uses 128 experts with 400B total. This sparse routing strategy delivers strong performance at a fraction of the compute cost of equivalent dense architectures.

Function calling

Built-in function calling across both Llama 4 models enables agentic workflows without additional fine-tuning. Define your tools, and the model will decide when and how to call them. This makes it straightforward to build autonomous agents that query databases, call APIs, execute code, and chain operations together.

Extended context

Scout offers a 10M token context window for extreme long-document tasks, while Maverick provides 1M tokens for most production scenarios. Both far exceed the 128K limit of previous generation models, giving you room to include more context, more examples, and more history in every request.

Multilingual

Strong multilingual support across both Llama 4 models enables global applications. Whether your users communicate in English, Chinese, Spanish, French, or other supported languages, both variants maintain consistent quality with culturally aware responses.

Open weights

Both Llama 4 models are fully open-weight under the Llama 3.1 compatible license. Deploy anywhere, modify freely, and fine-tune for your specific needs. This openness means no vendor lock-in, full transparency into model behavior, and the ability to run entirely on your own infrastructure.

Quick selection guide

Which of the Llama 4 models should you choose?

Match your primary use case to the right variant.

Choose Scout when

You need to process very long documents (10M tokens)
Entire codebase analysis across hundreds of files
Multi-document research and synthesis
Extended conversation histories
Lower memory requirements (109B vs 400B total)

Choose Maverick when

Maximum quality is the priority
Complex reasoning and scientific tasks
Code generation and debugging
Multimodal analysis (screenshots, diagrams)
Tasks where benchmark performance matters most

Start Chatting View all benchmarks

Performance

Complete benchmark comparison across Llama 4 models

Scout optimizes for context length, Maverick for raw quality. Both deliver strong performance relative to their design goals.

Choosing between the Llama 4 models comes down to your primary need. If your workflow involves processing large volumes of text, code, or documents in a single call, Scout's 10M token context window is unmatched. If you need the highest possible quality for reasoning, coding, or multimodal tasks, Maverick's 128-expert architecture delivers frontier-class results that compete with the best proprietary offerings. Many teams use both: Maverick for quality-critical tasks and Scout for large-scale analysis.

Start Chatting View model card

Maverick: 80.5% MMLU Pro, 73.4% MMMU, beats GPT-4o on coding

Scout: 10M token context, 95%+ retrieval at 8M tokens

Both: 17B active parameters, native multimodal, function calling

Both: open-weight under Llama 3.1 compatible license

Full comparison

Scout vs Maverick side by side

Complete benchmark results across reasoning, coding, multimodal, and deployment metrics.

Benchmark	Maverick 128 experts Flagship	Scout 16 experts Long Context
MMLU Pro Knowledge & reasoning	80.5%	74.3%
GPQA Diamond Scientific knowledge	69.8%	57.2%
LiveCodeBench v5 Coding	43.4%	32.8%
MMMU Multimodal	73.4%	69.4%
Context Window Max tokens	1M	10M
Total Parameters Model size	400B	109B
Active Parameters Per token	17B	17B
Number of Experts MoE routing	128	16

Data from Meta's official model card and independent evaluations.

Scout

Llama 4 Scout: when context length is everything

Scout's 10M token context window is unmatched among the Llama 4 models and across the entire open-weight landscape. It can process entire codebases, multi-document research sets, and hours of transcripts in a single call. If your task involves very long inputs, Scout is the clear choice.

10M token context, the longest of any open model available today
95%+ retrieval accuracy up to 8M tokens in needle-in-a-haystack tests
109B total parameters across 16 experts with 17B active per token
Process entire GitHub repositories for comprehensive code review
Ideal for legal document analysis, research synthesis, and audit workflows

Try Scout Scout details

Maverick

Llama 4 Maverick: when quality is the priority

Maverick's 128-expert architecture delivers frontier-class performance that outperforms GPT-4o on key benchmarks. It is the default model on this site for good reason: it handles complex reasoning, coding, and multimodal tasks with the quality you would expect from the best proprietary alternatives.

80.5% MMLU Pro for frontier-class knowledge and reasoning
Outperforms GPT-4o on coding benchmarks with 43.4% on LiveCodeBench v5
400B total parameters across 128 experts for deep domain specialization
73.4% on MMMU for strong multimodal understanding of images and documents
Native function calling for building autonomous agent workflows

Try Maverick Maverick details

Selection Guide

Choosing the right option from the Llama 4 models

Picking between the Llama 4 models depends on what matters most for your specific workflow. Both share the same 17B active parameter footprint and native multimodal support, so the decision comes down to context length versus output quality. Many teams find value in using both variants for different parts of their pipeline.

Pick Scout for tasks that require processing more than 1 million tokens at once
Pick Maverick for tasks where output quality and reasoning depth matter most
Both share 17B active parameters, so inference cost per token is comparable
Use Scout for ingestion and analysis, then Maverick for synthesis and generation
Both run under the same open-weight license, so you can deploy either or both freely

Try now

Start chatting with Llama 4

Try both models instantly through our chat interface.

Start Chatting

Chat with Llama 4 models instantly, no setup required

Model card

Complete technical specifications for both variants

Documentation

Integration guides and best practices

Download

Get model weights

Download official weights for either Llama 4 variant.

Hugging Face

All Llama 4 model repositories

Ollama

Run either variant locally with Ollama

GitHub

Source code and examples

FAQ

Frequently asked questions about Llama 4 models

Answers to the most common questions about choosing, running, and deploying the Llama 4 models for your projects.

How many Llama 4 models are available right now?

There are currently two Llama 4 models: Scout and Maverick. Each comes in two variants, an instruction-tuned version optimized for chat and task completion, and a base pre-trained version for fine-tuning and research. That gives you four total checkpoints to choose from depending on whether you need a ready-to-use conversational model or a foundation for custom training.

Which Llama 4 model is best for coding tasks?

Maverick is the stronger choice for coding tasks. It scores 43.4% on LiveCodeBench v5, outperforming both Scout (32.8%) and GPT-4o (37.0%). The 128-expert architecture provides deep specialization across programming languages and frameworks. However, if you need to analyze an entire large codebase at once, Scout's 10M token context window lets you load everything into a single call for cross-file analysis.

Can I run any Llama 4 model on a consumer GPU?

Running the full versions requires multi-GPU setups. Scout needs approximately 220 GB of VRAM at full precision, and Maverick needs around 800 GB. However, quantized versions significantly reduce these requirements. Scout with INT4 quantization can fit on roughly 55 GB, which is achievable with high-end consumer GPUs. Maverick with INT4 still needs around 200 GB, making it more suited to cloud or enterprise hardware.

What is the difference between Scout and Maverick in the Llama 4 family?

Scout is optimized for long-context tasks with a 10M token window and 16 experts (109B total parameters). Maverick prioritizes output quality with 128 experts and 400B total parameters but has a 1M token context window. Both activate 17B parameters per token. Think of Scout as the wide-angle lens and Maverick as the high-resolution lens in the same camera system.

Are all Llama 4 models free and open weight?

Yes. All Llama 4 models are released under the Llama 3.1 compatible license, which permits commercial use, fine-tuning, and redistribution. You can deploy them on your own infrastructure, build products on top of them, and modify the weights for your specific needs. The license includes usage thresholds for very large-scale deployments serving hundreds of millions of users.

Which Llama 4 model should I choose for document analysis?

It depends on the volume and complexity of your documents. For analyzing large collections of documents, contracts, or research papers in a single pass, Scout's 10M token context window is ideal. For shorter documents where you need the highest quality extraction, summarization, or reasoning, Maverick's 128-expert architecture produces more nuanced and accurate results. Both support native image understanding for documents with charts, tables, and diagrams.

Llama 4 Family