Llama 4 Scout
10 million tokens of context - the longest window in any open model
Llama 4 Scout redefines what a single model call can accomplish. Built on Meta's mixture of experts architecture with 109B total parameters and only 17B active per token, it delivers the longest context window of any openly available model at 10 million tokens. Feed it an entire codebase spanning hundreds of files, a full research library with dozens of papers, or hours of meeting transcripts. Where other models force you to chunk and summarize, Llama 4 Scout processes everything at once, preserving cross-document relationships and subtle connections that chunking would destroy.
Model variants
Instruction-tuned and base models
Choose between the instruction-tuned variant optimized for chat and long-context tasks, or the base model for fine-tuning and custom applications.
Mixture-of-Experts Architecture
109B total parameters, 17B active per token
Llama 4 Scout uses a sparse MoE design with 16 experts, activating 17B parameters per forward pass. The standout feature is its 10 million token context window - the longest of any openly available model.
Ideal for tasks that require processing massive amounts of text: entire codebases, multi-document analysis, long research papers, and extended conversation histories.
Instruction-tuned
Scout Instruct
Optimized for conversational AI and long-context task completion
Fine-tuned for following instructions, multi-turn dialogue, and processing very long inputs
Pre-trained
Scout Base
Foundation MoE model for fine-tuning and specialized applications
Pre-trained on diverse multimodal data with 16-expert routing
Capabilities
What makes Llama 4 Scout a long context powerhouse
Llama 4 Scout combines an unprecedented 10M token context window with MoE efficiency, native multimodal support, and strong reasoning capabilities. Every feature is designed to handle tasks that demand processing large volumes of information in a single pass.
10M token context window
The longest context window of any openly available model. Process entire codebases spanning 50,000 lines across hundreds of files, multi-document research libraries, or hours of conversation in a single call. Needle in a haystack tests confirm 95% retrieval accuracy up to 8 million tokens, with 89% accuracy at the full 10 million token limit.
MoE efficiency
Activates only 17B parameters per token from a 109B pool across 16 experts. This sparse routing strategy delivers strong performance at a fraction of the compute cost of dense models with similar total parameter counts. The result is practical deployment on fewer GPUs than you might expect for a model of this capacity.
Code analysis at scale
Load entire repositories into context for cross-file analysis, dependency tracking, and large-scale refactoring tasks. Llama 4 Scout can trace function calls across modules, identify unused imports, and suggest architectural improvements while seeing the full picture of your codebase simultaneously.
Agentic workflows
Native function calling and tool use support enables autonomous agents without additional fine-tuning. Build workflows that chain multiple tools, query databases, call APIs, and process results in sequence. The extended context window means agents can maintain rich state across many interaction steps.
Multilingual support
Strong performance across multiple languages with cultural context understanding for global applications. Whether you are analyzing documents in English, Chinese, Spanish, or other supported languages, Llama 4 Scout maintains consistent quality and nuanced comprehension across linguistic boundaries.
Native multimodal
Process text and images together with early fusion architecture. Analyze screenshots, diagrams, charts, and documents alongside text without needing separate vision pipelines. The multimodal capability is built into the model from the ground up, enabling seamless reasoning across visual and textual information.
Key highlights
Why the Llama 4 Scout context window matters
A 10M token context window changes what's possible with a single model call.
What you can fit in 10M tokens
- An entire medium-sized codebase (50K+ lines across hundreds of files)
- Multiple research papers or an entire book
- Hours of meeting transcripts or conversation history
- Complete documentation sets for complex systems
- 95%+ retrieval accuracy up to 8M tokens in needle-in-a-haystack tests
Technical specs
- 109B total parameters, 17B active per token
- 16 experts in MoE architecture
- 10M token context window
- Native multimodal (text + image)
- Llama 3.1 compatible license
Performance
Long-context specialist with competitive reasoning
Llama 4 Scout delivers strong performance across standard benchmarks while offering an unmatched 10M token context window for long-document tasks.
In real-world usage, Llama 4 Scout shines when tasks demand processing large volumes of information. Developers report successfully loading entire GitHub repositories for comprehensive code review, researchers feed complete paper collections for literature synthesis, and legal teams process full contract libraries for clause comparison. While Maverick leads on raw benchmark scores, Scout's 10M context window makes it the clear choice for workflows where seeing everything at once is more valuable than marginal quality gains on short prompts.
10M token context window - longest of any open model
95%+ retrieval accuracy up to 8M tokens
17B active parameters from 109B total (16 experts)
Competitive with models 2-3x its active parameter count
Native multimodal support for text and image inputs
Benchmark comparison
Scout vs Maverick and the Llama 4 family
Scout trades some raw benchmark performance for its massive context window advantage.
| Benchmark | Llama 4 Scout 16 experts Featured | Llama 4 Maverick 128 experts | Llama 3.1 70B Dense |
|---|---|---|---|
MMLU Pro Knowledge & reasoning | 74.3% | 80.5% | 66.4% |
GPQA Diamond Scientific knowledge | 57.2% | 69.8% | 46.7% |
LiveCodeBench v5 Coding | 32.8% | 43.4% | 28.5% |
MMMU Multimodal | 69.4% | 73.4% | - |
Context Window Max tokens | 10M | 1M | 128K |
Total Parameters Model size | 109B | 400B | 70B |
Active Parameters Per token | 17B | 17B | 70B |
Data from Meta's official model card and independent evaluations.
Long Context
10M tokens: process entire codebases with Llama 4 Scout
The 10M token context window in Llama 4 Scout is the longest of any openly available model. Load entire repositories, multi-document research sets, or hours of transcripts into a single context for comprehensive analysis without losing information to chunking or summarization.
- 95%+ retrieval accuracy up to 8M tokens in needle-in-a-haystack tests
- 89% accuracy at the full 10M token limit for reliable long-range retrieval
- Process 50K+ lines of code across hundreds of files simultaneously
- Analyze complete research paper collections without splitting documents
- Maintain full conversation history across extended multi-turn sessions
MoE Architecture
How Llama 4 Scout delivers 109B capacity at 17B cost
The 16-expert MoE architecture in Llama 4 Scout activates only 17B parameters per token while maintaining the representational capacity of a much larger model. This makes it practical to deploy on a single node while still delivering strong performance across reasoning, coding, and analysis tasks.
- 16 experts with 17B active parameters per forward pass for efficient inference
- Same active parameter count as Maverick at significantly lower total memory
- Practical for single-node deployment scenarios with fewer GPU requirements
- Sparse routing ensures each token gets specialized expert attention
- Lower operational cost compared to dense models with similar total parameters
Multimodal
Multimodal capabilities in Llama 4 Scout
Llama 4 Scout uses early fusion architecture to process text and images together natively. Visual understanding is built into the model from the ground up rather than added as a separate module, enabling seamless reasoning across both modalities within the same massive context window.
- 69.4% on MMMU multimodal benchmark for strong visual reasoning
- Early fusion architecture processes images and text in a unified stream
- Analyze screenshots, diagrams, flowcharts, and technical drawings alongside code
- Combine visual document analysis with the full 10M token context window
- No separate vision pipeline needed, reducing deployment complexity
Get started
Try Llama 4 Scout now
Start chatting instantly or download weights for self-hosted deployment.
Download & deploy
Self-hosted deployment
Download official model weights for deployment on your infrastructure.
FAQ
Frequently asked questions about Llama 4 Scout
Answers to the most common questions developers and researchers ask about running, deploying, and getting the most out of Llama 4 Scout.
Running the full precision version of Llama 4 Scout requires approximately 220 GB of VRAM, which typically means a multi-GPU setup with at least two A100 80 GB cards. Quantized versions can reduce this significantly. INT8 quantization brings the requirement down to around 110 GB, and INT4 quantization can fit on roughly 55 GB, making it accessible on high-end consumer setups with multiple GPUs.
Yes. The 10 million token context window in Llama 4 Scout can hold approximately 50,000 lines of code across hundreds of files simultaneously. This means most medium-sized repositories fit entirely within a single context call, enabling cross-file analysis, dependency tracking, and architectural review without chunking or losing context between files.
Llama 4 Scout is optimized for long-context tasks with its 10M token window and 16 experts (109B total parameters). Maverick prioritizes raw quality with 128 experts and 400B total parameters but has a 1M token context window. Both activate 17B parameters per token. Choose Scout when you need massive context, choose Maverick when you need maximum benchmark performance.
Yes. Llama 4 Scout is released under the Llama 3.1 compatible license, which permits commercial use. You can deploy it in production applications, build products on top of it, and fine-tune it for your specific needs. The license does include certain usage thresholds for very large-scale deployments, so review the full license terms if your application serves hundreds of millions of users.
The 10M token context window allows Llama 4 Scout to accept and process up to 10 million tokens in a single inference call. This is achieved through architectural innovations in positional encoding and attention mechanisms that maintain coherence over extremely long sequences. Needle-in-a-haystack tests show 95% retrieval accuracy up to 8M tokens and 89% at the full 10M limit.
Llama 4 Scout supports all major programming languages including Python, JavaScript, TypeScript, Java, C++, Go, Rust, and many more. Its training data covers a broad range of open source repositories. The real advantage is the context window: you can load entire multi-language projects and analyze cross-language interactions, API boundaries, and full-stack architectures in a single call.
Llama 4 Family
Explore the full Llama 4 lineup
Scout is part of Meta's Llama 4 family. Compare it with Maverick and see how it stacks up against other open models.
Get started
Ready to try Llama 4 Scout?
Start chatting instantly for free, or download the model for self-hosted deployment. The 10M token context window is waiting.