NVIDIA Nemotron 3 Super: Architecture, Benchmarks & Guide

NVIDIA Nemotron 3 Super: The Complete Guide to NVIDIA's Open Agentic AI Model
The agentic AI market is projected to reach $9.14 billion in 2026, growing at a 40.5% CAGR. At the center of this surge, NVIDIA launched Nemotron 3 Super at GTC 2026 — an open model that delivers 5x higher throughput than its predecessor while doubling accuracy. With 120 billion total parameters and only 12 billion active during inference, it redefines the efficiency-performance tradeoff for enterprise AI.
This guide covers everything you need to know about NVIDIA Nemotron 3 Super: its hybrid architecture, real benchmark numbers, enterprise use cases, and step-by-step deployment options.
Table of Contents
- What Is NVIDIA Nemotron 3 Super?
- Architecture Deep Dive: Mamba-Transformer Meets LatentMoE
- Benchmarks: How Nemotron 3 Super Stacks Up
- The 1-Million-Token Context Window Explained
- Enterprise Use Cases and Early Adopters
- How to Access and Deploy Nemotron 3 Super
- Nemotron 3 Family: Nano vs Super vs Ultra
- Conclusion
- Frequently Asked Questions
What Is NVIDIA Nemotron 3 Super?
NVIDIA Nemotron 3 Super is an open-weight large language model built specifically for agentic AI workloads. NVIDIA announced it on March 11, 2026 at GTC, positioning it as the backbone for enterprise multi-agent systems.
Here are the core specifications:
- Total parameters: 120 billion
- Active parameters: Only 12 billion during inference
- Training data: NVIDIA trained the model on 25 trillion tokens
- Context window: 1 million tokens
- Supported languages: English, French, German, Italian, Japanese, Spanish, and Chinese
- License: NVIDIA Nemotron Open Model License (commercial use permitted)
- Training data cutoff: February 2026 (pre-training: June 2025)
Nemotron 3 Super sits in the middle of NVIDIA's Nemotron 3 family. The smaller Nano model handles lightweight tasks, while the upcoming Ultra model targets deep reasoning workloads. Super occupies the sweet spot — powerful enough for complex multi-agent orchestration, efficient enough to run at scale.
Architecture Deep Dive: Mamba-Transformer Meets LatentMoE
What makes Nemotron 3 Super unique is its fusion of three architectural innovations into a single model. This hybrid design gives it both speed and accuracy advantages over pure Transformer models.
Mamba-2 Layers: Linear-Time Sequence Processing
Traditional Transformers suffer from quadratic compute costs as sequence length grows. Every token must attend to every other token, making million-token contexts prohibitively expensive. Mamba-2 layers solve this with a state-space model (SSM) approach that processes sequences in linear time. This is what enables Nemotron 3 Super to handle 1-million-token contexts efficiently.
Transformer Attention: Precision When It Matters
Mamba layers alone cannot handle every task. Information retrieval, complex reasoning, and tasks requiring precise attention over specific tokens still benefit from traditional attention mechanisms. Nemotron 3 Super strategically interleaves Transformer attention layers with Mamba layers, getting the best of both worlds.
LatentMoE: 4x More Experts, Same Compute Budget
Latent Mixture of Experts (LatentMoE) is the standout architectural innovation. In standard MoE architectures, the model routes tokens directly to expert networks. LatentMoE takes a different approach — it projects tokens into a smaller latent dimension before routing. This design choice lets the model pack 4x more experts into the same compute budget, significantly improving accuracy per byte.
According to NVIDIA, LatentMoE delivers stronger generalization than standard MoE at equivalent compute costs.
Multi-Token Prediction and NVFP4 Training
The model also uses Multi-Token Prediction (MTP) layers that predict multiple tokens per step, boosting generation speed and output quality simultaneously.
NVIDIA trained Nemotron 3 Super using NVFP4 (4-bit floating point) from the very first gradient — a first in the industry. This native 4-bit training on Blackwell GPUs dramatically reduces memory requirements without meaningful accuracy loss. The result is a model that runs up to 4x faster on Blackwell compared to the previous-generation Hopper platform.
Benchmarks: How Nemotron 3 Super Stacks Up
Nemotron 3 Super posts impressive numbers across multiple benchmarks, particularly in agentic and coding tasks. Here is an honest look at where it excels and where it falls short.
Where It Leads
| Benchmark | Nemotron 3 Super | vs. GPT-OSS-120B | Category |
|---|---|---|---|
| PinchBench | 85.6% | Best open model | Agentic reasoning |
| DeepResearch Bench | #1 | Best open model | Multi-step research |
| SWE-Bench Verified | 60.47% | 41.90% | Software engineering |
| RULER (1M tokens) | 91.75% | 22.30% | Long-context accuracy |
| AIME 2025 | Class leader | — | Math reasoning |

Throughput Advantage
Speed is where Nemotron 3 Super truly dominates:
- 5x higher throughput vs. previous-generation Nemotron Super
- 2.2x higher throughput vs. GPT-OSS-120B
- 7.5x higher throughput vs. Qwen3.5-122B
- Output speed: approximately 450–484 tokens per second depending on provider
- Speculative decoding: 3.45 tokens accepted per verification step on SPEED-Bench (vs. 2.70 for DeepSeek-R1), enabling up to 3x wall-clock speedups
The hybrid architecture and NVFP4 training drive these gains. For high-volume enterprise workloads, this throughput advantage translates directly into lower cost per query.
Where It Falls Short
Nemotron 3 Super is not the best model in every category. It scores 36 on the Artificial Analysis Intelligence Index, trailing closed models like GPT-5 (57 points) and Claude Opus 4.6 (53 points). On benchmarks like Arena-Hard V2 and GPQA Diamond, it also shows relatively weaker performance.
The critical distinction: Nemotron 3 Super achieves these results as a fully open model. When you compare it against other open models in the same size class, it consistently leads in throughput, coding, and long-context tasks.
The 1-Million-Token Context Window Explained
A 1-million-token context window is not just about reading long documents. For agentic AI, it solves a fundamental problem.
The Goal Drift Problem
Autonomous AI agents executing multi-step workflows face goal drift — the agent gradually loses track of its original objective as the task chain grows. With a short context window, the agent must chunk and summarize information, losing critical details along the way. This leads to errors and task failures.
Nemotron 3 Super's 1-million-token context allows agents to keep their entire workflow state in memory. The agent maintains full context throughout the task, dramatically reducing goal drift and improving task completion rates.
Real-World Scenarios
Software development: A coding agent can load an entire codebase into context at once, enabling end-to-end code generation and debugging without document segmentation.
Financial analysis: An analyst agent can ingest thousands of pages of financial reports in a single pass, cross-referencing data across documents for more consistent analysis.
Cybersecurity: A security orchestration agent can evaluate massive log files and event records holistically, detecting threats faster than agents that process logs in chunks.
RULER benchmark scores confirm this capability is reliable: 96.3% accuracy at 256K tokens, 95.67% at 512K tokens, and 91.75% at 1M tokens.
Enterprise Use Cases and Early Adopters
NVIDIA optimized Nemotron 3 Super specifically for multi-agent systems. Gartner projects that 40% of enterprise applications will embed task-specific AI agents by the end of 2026 — up from less than 5% in 2025. Nemotron 3 Super is positioned to power many of these deployments.
Multi-Agent Orchestration
Enterprise environments increasingly require multiple AI agents working in coordination. Nemotron 3 Super's low-latency inference and high-accuracy tool calling make it well-suited for these scenarios.
Consider an IT helpdesk: separate agents handle ticket classification, root cause diagnosis, and resolution recommendation in parallel. NVIDIA reports significant efficiency gains in these multi-agent configurations, with the model's high-throughput architecture enabling more concurrent agent interactions per GPU.
Early Adopters
Major enterprises are already deploying and customizing Nemotron 3 Super:
- Amdocs: Telecom customer service automation
- Palantir: Cybersecurity and intelligence analytics
- Cadence: Semiconductor design workflows
- Dassault Systèmes: Manufacturing and engineering simulation
- Siemens: Industrial automation processes
Deployment Options
Enterprises can choose from flexible deployment paths:
- Cloud: Google Cloud Vertex AI, Oracle Cloud Infrastructure, Amazon Bedrock (coming soon), Microsoft Azure (coming soon)
- On-premise: Dell AI Factory and HPE Agents Hub for in-house deployment
- Hybrid: NVIDIA NIM container infrastructure for both cloud and local deployment
Because the model is fully open, organizations with strict data privacy and security requirements can run it on their own infrastructure with complete control over their data.
How to Access and Deploy Nemotron 3 Super
You can access Nemotron 3 Super through multiple channels. Choose the approach that fits your needs.
API Access
The fastest way to start is through hosted APIs:
- NVIDIA build.nvidia.com: Try the model directly on NVIDIA's own platform
- DeepInfra: $0.10 per 1M input tokens, $0.50 per 1M output tokens
- OpenRouter: Free tier available alongside paid options
- Nebius: Competitive pricing with high-throughput infrastructure
- Perplexity: Available with Pro subscription
Download Model Weights
You can download weights from Hugging Face in multiple precision formats:
- BF16: Full precision for research and fine-tuning
- FP8: Reduced memory footprint for production deployment
- NVFP4: Smallest memory footprint, optimized for Blackwell GPUs
Self-Hosted Deployment
Running Nemotron 3 Super locally requires at least 64 GB of RAM, VRAM, or unified memory. NVIDIA provides ready-to-use cookbooks for three inference engines:
- vLLM: High throughput with continuous batching and streaming support
- SGLang: Lightweight and fast, optimized for multi-agent tool-calling workloads
- TensorRT-LLM: Production-grade low latency with native LatentMoE kernel support
For a deeper understanding of agentic AI architectures and multi-agent design patterns, check out our guide to building AI agent systems.
Fine-Tuning
NVIDIA released customization cookbooks for LoRA/SFT and GRPO/DAPO-based training. The Unsloth platform also provides step-by-step guides for local fine-tuning. NVIDIA is additionally releasing the complete training recipe — covering pretraining through alignment — so you can reproduce Super's training pipeline or adapt it for domain-specific variants.
Nemotron 3 Family: Nano vs Super vs Ultra
Nemotron 3 is not a single model — it is a family of models optimized for different workload tiers.
| Feature | Nemotron 3 Nano | Nemotron 3 Super | Nemotron 3 Ultra |
|---|---|---|---|
| Total Parameters | 30 billion | 120 billion | ~500 billion |
| Active Parameters | 3 billion | 12 billion | ~50 billion |
| Target Use Case | Lightweight tasks | Multi-agent systems | Deep reasoning |
| Availability | Available now | Available now | Coming H1 2026 |
When to Use Each Model
Nemotron 3 Nano excels at content summarization, software debugging, information retrieval, and AI assistant workflows. It delivers 4x higher throughput than the previous generation and reduces reasoning token generation by up to 60%.
Nemotron 3 Super is built for scenarios where multiple agents coordinate on complex tasks. IT ticket automation, financial analysis, cybersecurity orchestration, and software development are its strong suits.
Nemotron 3 Ultra targets deep research, strategic planning, and advanced reasoning applications. NVIDIA is developing it for workloads that demand the highest accuracy regardless of compute cost.
This tiered design lets you match the right model to the right workload. A simple merge request goes to Nano. A complex full-codebase analysis goes to Super. A multi-day research synthesis goes to Ultra.
Conclusion
NVIDIA Nemotron 3 Super marks a turning point for open-source agentic AI. Its hybrid Mamba-Transformer architecture, LatentMoE innovation, and 1-million-token context window create a model that competes with closed alternatives on agentic tasks while remaining fully open.
Four key takeaways:
- Efficiency meets power: 120B parameters with only 12B active means lower cost and higher throughput at scale
- Best-in-class throughput: 2.2x to 7.5x faster than comparable open models, with ~450 tokens per second output speed
- Enterprise-ready: Amdocs, Palantir, and Siemens are already deploying and customizing the model
- Fully open: Weights, datasets, and training recipes are all open under a commercial-friendly license
You can start experimenting with Nemotron 3 Super today on build.nvidia.com. As agentic AI moves from experimental to production, models like Nemotron 3 Super will define the infrastructure layer that makes it possible.
Frequently Asked Questions
What is NVIDIA Nemotron 3 Super?
NVIDIA built Nemotron 3 Super as a 120-billion-parameter open model with 12 billion active parameters during inference. It uses a hybrid Mamba-Transformer LatentMoE architecture designed specifically for agentic AI workloads, excelling at multi-agent orchestration, long-context tasks, and software engineering.
How does LatentMoE differ from traditional Mixture of Experts?
Traditional MoE routes tokens directly to expert networks. LatentMoE first projects tokens into a smaller latent dimension before routing, allowing the model to fit 4x more experts within the same compute budget. This improves accuracy per byte without increasing inference cost.
Is Nemotron 3 Super free to use?
Yes. You can download model weights from Hugging Face under the NVIDIA Nemotron Open Model License, which permits commercial use. Free API access is available through build.nvidia.com and OpenRouter. Paid providers like DeepInfra charge approximately $0.10 per million input tokens.
What hardware do I need to run Nemotron 3 Super locally?
You need at least 64 GB of RAM, VRAM, or unified memory. The FP8 and NVFP4 quantized versions significantly reduce memory requirements. NVIDIA provides ready-to-use deployment cookbooks for vLLM, SGLang, and TensorRT-LLM inference engines.
How does Nemotron 3 Super compare to closed models like GPT-5?
Nemotron 3 Super scores 36 on the Artificial Analysis Intelligence Index, behind closed models like GPT-5 (57) and Claude Opus 4.6 (53). However, it leads all open models in its size class on throughput, SWE-Bench Verified (60.47%), and RULER long-context benchmarks (91.75% at 1M tokens). For agentic workloads prioritizing speed and efficiency, it offers a compelling open alternative.
