Llama 4 Scout

10 million tokens of context - the longest window in any open model

Llama 4 Scout redefines what a single model call can accomplish. Built on Meta's mixture of experts architecture with 109B total parameters and only 17B active per token, it delivers the longest context window of any openly available model at 10 million tokens. Feed it an entire codebase spanning hundreds of files, a full research library with dozens of papers, or hours of meeting transcripts. Where other models force you to chunk and summarize, Llama 4 Scout processes everything at once, preserving cross-document relationships and subtle connections that chunking would destroy.

Start Chatting View benchmarks

Model variants

Instruction-tuned and base models

Choose between the instruction-tuned variant optimized for chat and long-context tasks, or the base model for fine-tuning and custom applications.

Mixture-of-Experts Architecture

109B total parameters, 17B active per token

Llama 4 Scout uses a sparse MoE design with 16 experts, activating 17B parameters per forward pass. The standout feature is its 10 million token context window - the longest of any openly available model.

Ideal for tasks that require processing massive amounts of text: entire codebases, multi-document analysis, long research papers, and extended conversation histories.

Start Chatting See capabilities

Instruction-tuned

Scout Instruct

Optimized for conversational AI and long-context task completion

Fine-tuned for following instructions, multi-turn dialogue, and processing very long inputs

Available now

Start Chatting Download weights

Pre-trained

Scout Base

Foundation MoE model for fine-tuning and specialized applications

Pre-trained on diverse multimodal data with 16-expert routing

Available now

View on HuggingFace Documentation

Capabilities

What makes Llama 4 Scout a long context powerhouse

Llama 4 Scout combines an unprecedented 10M token context window with MoE efficiency, native multimodal support, and strong reasoning capabilities. Every feature is designed to handle tasks that demand processing large volumes of information in a single pass.

10M token context window

The longest context window of any openly available model. Process entire codebases spanning 50,000 lines across hundreds of files, multi-document research libraries, or hours of conversation in a single call. Needle in a haystack tests confirm 95% retrieval accuracy up to 8 million tokens, with 89% accuracy at the full 10 million token limit.

MoE efficiency

Activates only 17B parameters per token from a 109B pool across 16 experts. This sparse routing strategy delivers strong performance at a fraction of the compute cost of dense models with similar total parameter counts. The result is practical deployment on fewer GPUs than you might expect for a model of this capacity.

Code analysis at scale

Load entire repositories into context for cross-file analysis, dependency tracking, and large-scale refactoring tasks. Llama 4 Scout can trace function calls across modules, identify unused imports, and suggest architectural improvements while seeing the full picture of your codebase simultaneously.

Agentic workflows

Native function calling and tool use support enables autonomous agents without additional fine-tuning. Build workflows that chain multiple tools, query databases, call APIs, and process results in sequence. The extended context window means agents can maintain rich state across many interaction steps.

Multilingual support

Strong performance across multiple languages with cultural context understanding for global applications. Whether you are analyzing documents in English, Chinese, Spanish, or other supported languages, Llama 4 Scout maintains consistent quality and nuanced comprehension across linguistic boundaries.

Native multimodal

Process text and images together with early fusion architecture. Analyze screenshots, diagrams, charts, and documents alongside text without needing separate vision pipelines. The multimodal capability is built into the model from the ground up, enabling seamless reasoning across visual and textual information.

Key highlights

Why the Llama 4 Scout context window matters

A 10M token context window changes what's possible with a single model call.

What you can fit in 10M tokens

An entire medium-sized codebase (50K+ lines across hundreds of files)
Multiple research papers or an entire book
Hours of meeting transcripts or conversation history
Complete documentation sets for complex systems
95%+ retrieval accuracy up to 8M tokens in needle-in-a-haystack tests

Technical specs

109B total parameters, 17B active per token
16 experts in MoE architecture
10M token context window
Native multimodal (text + image)
Llama 3.1 compatible license

Start Free Chat Download weights

Performance

Long-context specialist with competitive reasoning

Llama 4 Scout delivers strong performance across standard benchmarks while offering an unmatched 10M token context window for long-document tasks.

In real-world usage, Llama 4 Scout shines when tasks demand processing large volumes of information. Developers report successfully loading entire GitHub repositories for comprehensive code review, researchers feed complete paper collections for literature synthesis, and legal teams process full contract libraries for clause comparison. While Maverick leads on raw benchmark scores, Scout's 10M context window makes it the clear choice for workflows where seeing everything at once is more valuable than marginal quality gains on short prompts.

Start Chatting View model card

Llama 4 Scout performance comparison chart

10M token context window - longest of any open model

95%+ retrieval accuracy up to 8M tokens

17B active parameters from 109B total (16 experts)

Competitive with models 2-3x its active parameter count

Native multimodal support for text and image inputs

Benchmark comparison

Scout vs Maverick and the Llama 4 family

Scout trades some raw benchmark performance for its massive context window advantage.

Benchmark	Llama 4 Scout 16 experts Featured	Llama 4 Maverick 128 experts	Llama 3.1 70B Dense
MMLU Pro Knowledge & reasoning	74.3%	80.5%	66.4%
GPQA Diamond Scientific knowledge	57.2%	69.8%	46.7%
LiveCodeBench v5 Coding	32.8%	43.4%	28.5%
MMMU Multimodal	69.4%	73.4%	-
Context Window Max tokens	10M	1M	128K
Total Parameters Model size	109B	400B	70B
Active Parameters Per token	17B	17B	70B

Data from Meta's official model card and independent evaluations.

Long Context

10M tokens: process entire codebases with Llama 4 Scout

The 10M token context window in Llama 4 Scout is the longest of any openly available model. Load entire repositories, multi-document research sets, or hours of transcripts into a single context for comprehensive analysis without losing information to chunking or summarization.

95%+ retrieval accuracy up to 8M tokens in needle-in-a-haystack tests
89% accuracy at the full 10M token limit for reliable long-range retrieval
Process 50K+ lines of code across hundreds of files simultaneously
Analyze complete research paper collections without splitting documents
Maintain full conversation history across extended multi-turn sessions

Try long-context tasks View benchmarks

MoE Architecture

How Llama 4 Scout delivers 109B capacity at 17B cost

The 16-expert MoE architecture in Llama 4 Scout activates only 17B parameters per token while maintaining the representational capacity of a much larger model. This makes it practical to deploy on a single node while still delivering strong performance across reasoning, coding, and analysis tasks.

16 experts with 17B active parameters per forward pass for efficient inference
Same active parameter count as Maverick at significantly lower total memory
Practical for single-node deployment scenarios with fewer GPU requirements
Sparse routing ensures each token gets specialized expert attention
Lower operational cost compared to dense models with similar total parameters

Start Chatting Compare with Maverick

Multimodal

Multimodal capabilities in Llama 4 Scout

Llama 4 Scout uses early fusion architecture to process text and images together natively. Visual understanding is built into the model from the ground up rather than added as a separate module, enabling seamless reasoning across both modalities within the same massive context window.

69.4% on MMMU multimodal benchmark for strong visual reasoning
Early fusion architecture processes images and text in a unified stream
Analyze screenshots, diagrams, flowcharts, and technical drawings alongside code
Combine visual document analysis with the full 10M token context window
No separate vision pipeline needed, reducing deployment complexity

Get started

Try Llama 4 Scout now

Start chatting instantly or download weights for self-hosted deployment.

Chat with Scout

Try Llama 4 Scout instantly - no setup required

Model card

Complete technical specifications and benchmarks

Documentation

Integration guides and best practices

Download & deploy

Self-hosted deployment

Download official model weights for deployment on your infrastructure.

Hugging Face

Official Llama 4 Scout model repository

Ollama

Run locally with Ollama

GitHub

Source code and examples

FAQ

Frequently asked questions about Llama 4 Scout

Answers to the most common questions developers and researchers ask about running, deploying, and getting the most out of Llama 4 Scout.

How much VRAM does Llama 4 Scout need to run locally?

Running the full precision version of Llama 4 Scout requires approximately 220 GB of VRAM, which typically means a multi-GPU setup with at least two A100 80 GB cards. Quantized versions can reduce this significantly. INT8 quantization brings the requirement down to around 110 GB, and INT4 quantization can fit on roughly 55 GB, making it accessible on high-end consumer setups with multiple GPUs.

Can Llama 4 Scout process an entire GitHub repository?

Yes. The 10 million token context window in Llama 4 Scout can hold approximately 50,000 lines of code across hundreds of files simultaneously. This means most medium-sized repositories fit entirely within a single context call, enabling cross-file analysis, dependency tracking, and architectural review without chunking or losing context between files.

What is the difference between Llama 4 Scout and Maverick?

Llama 4 Scout is optimized for long-context tasks with its 10M token window and 16 experts (109B total parameters). Maverick prioritizes raw quality with 128 experts and 400B total parameters but has a 1M token context window. Both activate 17B parameters per token. Choose Scout when you need massive context, choose Maverick when you need maximum benchmark performance.

Is Llama 4 Scout free to use commercially?

Yes. Llama 4 Scout is released under the Llama 3.1 compatible license, which permits commercial use. You can deploy it in production applications, build products on top of it, and fine-tune it for your specific needs. The license does include certain usage thresholds for very large-scale deployments, so review the full license terms if your application serves hundreds of millions of users.

How does the 10 million token context window work in Llama 4 Scout?

The 10M token context window allows Llama 4 Scout to accept and process up to 10 million tokens in a single inference call. This is achieved through architectural innovations in positional encoding and attention mechanisms that maintain coherence over extremely long sequences. Needle-in-a-haystack tests show 95% retrieval accuracy up to 8M tokens and 89% at the full 10M limit.

What programming languages does Llama 4 Scout support for code analysis?

Llama 4 Scout supports all major programming languages including Python, JavaScript, TypeScript, Java, C++, Go, Rust, and many more. Its training data covers a broad range of open source repositories. The real advantage is the context window: you can load entire multi-language projects and analyze cross-language interactions, API boundaries, and full-stack architectures in a single call.

Llama 4 Family