Llama 4 Scout

10 million tokens of context - the longest window in any open model

Llama 4 Scout redefines what a single model call can accomplish. Built on Meta's mixture of experts architecture with 109B total parameters and only 17B active per token, it delivers the longest context window of any openly available model at 10 million tokens. Feed it an entire codebase spanning hundreds of files, a full research library with dozens of papers, or hours of meeting transcripts. Where other models force you to chunk and summarize, Llama 4 Scout processes everything at once, preserving cross-document relationships and subtle connections that chunking would destroy.

Model variants

Instruction-tuned and base models

Choose between the instruction-tuned variant optimized for chat and long-context tasks, or the base model for fine-tuning and custom applications.

Mixture-of-Experts Architecture

109B total parameters, 17B active per token

Llama 4 Scout uses a sparse MoE design with 16 experts, activating 17B parameters per forward pass. The standout feature is its 10 million token context window - the longest of any openly available model.

Ideal for tasks that require processing massive amounts of text: entire codebases, multi-document analysis, long research papers, and extended conversation histories.

Instruction-tuned

Scout Instruct

Optimized for conversational AI and long-context task completion

Fine-tuned for following instructions, multi-turn dialogue, and processing very long inputs

Available now

Pre-trained

Scout Base

Foundation MoE model for fine-tuning and specialized applications

Pre-trained on diverse multimodal data with 16-expert routing

Available now

Capabilities

What makes Llama 4 Scout a long context powerhouse

Llama 4 Scout combines an unprecedented 10M token context window with MoE efficiency, native multimodal support, and strong reasoning capabilities. Every feature is designed to handle tasks that demand processing large volumes of information in a single pass.

10M token context window

The longest context window of any openly available model. Process entire codebases spanning 50,000 lines across hundreds of files, multi-document research libraries, or hours of conversation in a single call. Needle in a haystack tests confirm 95% retrieval accuracy up to 8 million tokens, with 89% accuracy at the full 10 million token limit.

MoE efficiency

Activates only 17B parameters per token from a 109B pool across 16 experts. This sparse routing strategy delivers strong performance at a fraction of the compute cost of dense models with similar total parameter counts. The result is practical deployment on fewer GPUs than you might expect for a model of this capacity.

Code analysis at scale

Load entire repositories into context for cross-file analysis, dependency tracking, and large-scale refactoring tasks. Llama 4 Scout can trace function calls across modules, identify unused imports, and suggest architectural improvements while seeing the full picture of your codebase simultaneously.

Agentic workflows

Native function calling and tool use support enables autonomous agents without additional fine-tuning. Build workflows that chain multiple tools, query databases, call APIs, and process results in sequence. The extended context window means agents can maintain rich state across many interaction steps.

Multilingual support

Strong performance across multiple languages with cultural context understanding for global applications. Whether you are analyzing documents in English, Chinese, Spanish, or other supported languages, Llama 4 Scout maintains consistent quality and nuanced comprehension across linguistic boundaries.

Native multimodal

Process text and images together with early fusion architecture. Analyze screenshots, diagrams, charts, and documents alongside text without needing separate vision pipelines. The multimodal capability is built into the model from the ground up, enabling seamless reasoning across visual and textual information.

Key highlights

Why the Llama 4 Scout context window matters

A 10M token context window changes what's possible with a single model call.

What you can fit in 10M tokens

  • An entire medium-sized codebase (50K+ lines across hundreds of files)
  • Multiple research papers or an entire book
  • Hours of meeting transcripts or conversation history
  • Complete documentation sets for complex systems
  • 95%+ retrieval accuracy up to 8M tokens in needle-in-a-haystack tests

Technical specs

  • 109B total parameters, 17B active per token
  • 16 experts in MoE architecture
  • 10M token context window
  • Native multimodal (text + image)
  • Llama 3.1 compatible license

Performance

Long-context specialist with competitive reasoning

Llama 4 Scout delivers strong performance across standard benchmarks while offering an unmatched 10M token context window for long-document tasks.

In real-world usage, Llama 4 Scout shines when tasks demand processing large volumes of information. Developers report successfully loading entire GitHub repositories for comprehensive code review, researchers feed complete paper collections for literature synthesis, and legal teams process full contract libraries for clause comparison. While Maverick leads on raw benchmark scores, Scout's 10M context window makes it the clear choice for workflows where seeing everything at once is more valuable than marginal quality gains on short prompts.

Llama 4 Scout performance comparison chart

10M token context window - longest of any open model

95%+ retrieval accuracy up to 8M tokens

17B active parameters from 109B total (16 experts)

Competitive with models 2-3x its active parameter count

Native multimodal support for text and image inputs

Benchmark comparison

Scout vs Maverick and the Llama 4 family

Scout trades some raw benchmark performance for its massive context window advantage.

Benchmark
Llama 4 Scout
16 experts
Featured
Llama 4 Maverick
128 experts
Llama 3.1 70B
Dense
MMLU Pro
Knowledge & reasoning
74.3%80.5%66.4%
GPQA Diamond
Scientific knowledge
57.2%69.8%46.7%
LiveCodeBench v5
Coding
32.8%43.4%28.5%
MMMU
Multimodal
69.4%73.4%-
Context Window
Max tokens
10M1M128K
Total Parameters
Model size
109B400B70B
Active Parameters
Per token
17B17B70B

Data from Meta's official model card and independent evaluations.

Long Context

10M tokens: process entire codebases with Llama 4 Scout

The 10M token context window in Llama 4 Scout is the longest of any openly available model. Load entire repositories, multi-document research sets, or hours of transcripts into a single context for comprehensive analysis without losing information to chunking or summarization.

  • 95%+ retrieval accuracy up to 8M tokens in needle-in-a-haystack tests
  • 89% accuracy at the full 10M token limit for reliable long-range retrieval
  • Process 50K+ lines of code across hundreds of files simultaneously
  • Analyze complete research paper collections without splitting documents
  • Maintain full conversation history across extended multi-turn sessions
Llama 4 Scout MoE architecture

MoE Architecture

How Llama 4 Scout delivers 109B capacity at 17B cost

The 16-expert MoE architecture in Llama 4 Scout activates only 17B parameters per token while maintaining the representational capacity of a much larger model. This makes it practical to deploy on a single node while still delivering strong performance across reasoning, coding, and analysis tasks.

  • 16 experts with 17B active parameters per forward pass for efficient inference
  • Same active parameter count as Maverick at significantly lower total memory
  • Practical for single-node deployment scenarios with fewer GPU requirements
  • Sparse routing ensures each token gets specialized expert attention
  • Lower operational cost compared to dense models with similar total parameters
Llama 4 Scout 10M context window

Multimodal

Multimodal capabilities in Llama 4 Scout

Llama 4 Scout uses early fusion architecture to process text and images together natively. Visual understanding is built into the model from the ground up rather than added as a separate module, enabling seamless reasoning across both modalities within the same massive context window.

  • 69.4% on MMMU multimodal benchmark for strong visual reasoning
  • Early fusion architecture processes images and text in a unified stream
  • Analyze screenshots, diagrams, flowcharts, and technical drawings alongside code
  • Combine visual document analysis with the full 10M token context window
  • No separate vision pipeline needed, reducing deployment complexity

Download & deploy

Self-hosted deployment

Download official model weights for deployment on your infrastructure.

FAQ

Frequently asked questions about Llama 4 Scout

Answers to the most common questions developers and researchers ask about running, deploying, and getting the most out of Llama 4 Scout.

How much VRAM does Llama 4 Scout need to run locally?

Running the full precision version of Llama 4 Scout requires approximately 220 GB of VRAM, which typically means a multi-GPU setup with at least two A100 80 GB cards. Quantized versions can reduce this significantly. INT8 quantization brings the requirement down to around 110 GB, and INT4 quantization can fit on roughly 55 GB, making it accessible on high-end consumer setups with multiple GPUs.

Can Llama 4 Scout process an entire GitHub repository?

Yes. The 10 million token context window in Llama 4 Scout can hold approximately 50,000 lines of code across hundreds of files simultaneously. This means most medium-sized repositories fit entirely within a single context call, enabling cross-file analysis, dependency tracking, and architectural review without chunking or losing context between files.

What is the difference between Llama 4 Scout and Maverick?

Llama 4 Scout is optimized for long-context tasks with its 10M token window and 16 experts (109B total parameters). Maverick prioritizes raw quality with 128 experts and 400B total parameters but has a 1M token context window. Both activate 17B parameters per token. Choose Scout when you need massive context, choose Maverick when you need maximum benchmark performance.

Is Llama 4 Scout free to use commercially?

Yes. Llama 4 Scout is released under the Llama 3.1 compatible license, which permits commercial use. You can deploy it in production applications, build products on top of it, and fine-tune it for your specific needs. The license does include certain usage thresholds for very large-scale deployments, so review the full license terms if your application serves hundreds of millions of users.

How does the 10 million token context window work in Llama 4 Scout?

The 10M token context window allows Llama 4 Scout to accept and process up to 10 million tokens in a single inference call. This is achieved through architectural innovations in positional encoding and attention mechanisms that maintain coherence over extremely long sequences. Needle-in-a-haystack tests show 95% retrieval accuracy up to 8M tokens and 89% at the full 10M limit.

What programming languages does Llama 4 Scout support for code analysis?

Llama 4 Scout supports all major programming languages including Python, JavaScript, TypeScript, Java, C++, Go, Rust, and many more. Its training data covers a broad range of open source repositories. The real advantage is the context window: you can load entire multi-language projects and analyze cross-language interactions, API boundaries, and full-stack architectures in a single call.

Llama 4 Family

Explore the full Llama 4 lineup

Scout is part of Meta's Llama 4 family. Compare it with Maverick and see how it stacks up against other open models.

Llama 4 Maverick

400B MoE flagship with 128 experts

Compare

All Llama 4 Models

Complete family overview

View all

Llama 4 vs Kimi K2.6

Scout/Maverick vs Moonshot's 1T model

Compare

Llama 4 vs Qwen 3.6

Meta vs Alibaba's latest

Compare

Llama 4 vs DeepSeek V4

MoE architecture showdown

Compare

Llama 4 vs MiniMax M2.7

Context vs cost efficiency

Compare

Get started

Ready to try Llama 4 Scout?

Start chatting instantly for free, or download the model for self-hosted deployment. The 10M token context window is waiting.