NVIDIA Nemotron 3 Super: The Complete Guide to NVIDIA's Open Agentic AI Model

The agentic AI market is projected to reach $9.14 billion in 2026, growing at a 40.5% CAGR. At the center of this surge, NVIDIA launched Nemotron 3 Super at GTC 2026 — an open model that delivers 5x higher throughput than its predecessor while doubling accuracy. With 120 billion total parameters and only 12 billion active during inference, it redefines the efficiency-performance tradeoff for enterprise AI.

This guide covers everything you need to know about NVIDIA Nemotron 3 Super: its hybrid architecture, real benchmark numbers, enterprise use cases, and step-by-step deployment options.

What Is NVIDIA Nemotron 3 Super?
Architecture Deep Dive: Mamba-Transformer Meets LatentMoE
Benchmarks: How Nemotron 3 Super Stacks Up
The 1-Million-Token Context Window Explained
Enterprise Use Cases and Early Adopters
How to Access and Deploy Nemotron 3 Super
Nemotron 3 Family: Nano vs Super vs Ultra
Conclusion
Frequently Asked Questions

What Is NVIDIA Nemotron 3 Super?

NVIDIA Nemotron 3 Super is an open-weight large language model built specifically for agentic AI workloads. NVIDIA announced it on March 11, 2026 at GTC, positioning it as the backbone for enterprise multi-agent systems.

Here are the core specifications:

Total parameters: 120 billion
Active parameters: Only 12 billion during inference
Training data: NVIDIA trained the model on 25 trillion tokens
Context window: 1 million tokens
Supported languages: English, French, German, Italian, Japanese, Spanish, and Chinese
License: NVIDIA Nemotron Open Model License (commercial use permitted)
Training data cutoff: February 2026 (pre-training: June 2025)

Nemotron 3 Super sits in the middle of NVIDIA's Nemotron 3 family. The smaller Nano model handles lightweight tasks, while the upcoming Ultra model targets deep reasoning workloads. Super occupies the sweet spot — powerful enough for complex multi-agent orchestration, efficient enough to run at scale.

Architecture Deep Dive: Mamba-Transformer Meets LatentMoE

What makes Nemotron 3 Super unique is its fusion of three architectural innovations into a single model. This hybrid design gives it both speed and accuracy advantages over pure Transformer models.

Mamba-2 Layers: Linear-Time Sequence Processing

Traditional Transformers suffer from quadratic compute costs as sequence length grows. Every token must attend to every other token, making million-token contexts prohibitively expensive. Mamba-2 layers solve this with a state-space model (SSM) approach that processes sequences in linear time. This is what enables Nemotron 3 Super to handle 1-million-token contexts efficiently.

Transformer Attention: Precision When It Matters

Mamba layers alone cannot handle every task. Information retrieval, complex reasoning, and tasks requiring precise attention over specific tokens still benefit from traditional attention mechanisms. Nemotron 3 Super strategically interleaves Transformer attention layers with Mamba layers, getting the best of both worlds.

LatentMoE: 4x More Experts, Same Compute Budget

Latent Mixture of Experts (LatentMoE) is the standout architectural innovation. In standard MoE architectures, the model routes tokens directly to expert networks. LatentMoE takes a different approach — it projects tokens into a smaller latent dimension before routing. This design choice lets the model pack 4x more experts into the same compute budget, significantly improving accuracy per byte.

According to NVIDIA, LatentMoE delivers stronger generalization than standard MoE at equivalent compute costs.

Multi-Token Prediction and NVFP4 Training

The model also uses Multi-Token Prediction (MTP) layers that predict multiple tokens per step, boosting generation speed and output quality simultaneously.

NVIDIA trained Nemotron 3 Super using NVFP4 (4-bit floating point) from the very first gradient — a first in the industry. This native 4-bit training on Blackwell GPUs dramatically reduces memory requirements without meaningful accuracy loss. The result is a model that runs up to 4x faster on Blackwell compared to the previous-generation Hopper platform.

Benchmarks: How Nemotron 3 Super Stacks Up

Nemotron 3 Super posts impressive numbers across multiple benchmarks, particularly in agentic and coding tasks. Here is an honest look at where it excels and where it falls short.

Where It Leads

Benchmark	Nemotron 3 Super	vs. GPT-OSS-120B	Category
PinchBench	85.6%	Best open model	Agentic reasoning
DeepResearch Bench	#1	Best open model	Multi-step research
SWE-Bench Verified	60.47%	41.90%	Software engineering
RULER (1M tokens)	91.75%	22.30%	Long-context accuracy
AIME 2025	Class leader	—	Math reasoning

nemotron-benchmark

Throughput Advantage

Speed is where Nemotron 3 Super truly dominates:

5x higher throughput vs. previous-generation Nemotron Super
2.2x higher throughput vs. GPT-OSS-120B
7.5x higher throughput vs. Qwen3.5-122B
Output speed: approximately 450–484 tokens per second depending on provider
Speculative decoding: 3.45 tokens accepted per verification step on SPEED-Bench (vs. 2.70 for DeepSeek-R1), enabling up to 3x wall-clock speedups

The hybrid architecture and NVFP4 training drive these gains. For high-volume enterprise workloads, this throughput advantage translates directly into lower cost per query.

Where It Falls Short

Nemotron 3 Super is not the best model in every category. It scores 36 on the Artificial Analysis Intelligence Index, trailing closed models like GPT-5 (57 points) and Claude Opus 4.6 (53 points). On benchmarks like Arena-Hard V2 and GPQA Diamond, it also shows relatively weaker performance.

The critical distinction: Nemotron 3 Super achieves these results as a fully open model. When you compare it against other open models in the same size class, it consistently leads in throughput, coding, and long-context tasks.

The 1-Million-Token Context Window Explained

A 1-million-token context window is not just about reading long documents. For agentic AI, it solves a fundamental problem.

The Goal Drift Problem

Autonomous AI agents executing multi-step workflows face goal drift — the agent gradually loses track of its original objective as the task chain grows. With a short context window, the agent must chunk and summarize information, losing critical details along the way. This leads to errors and task failures.

Nemotron 3 Super's 1-million-token context allows agents to keep their entire workflow state in memory. The agent maintains full context throughout the task, dramatically reducing goal drift and improving task completion rates.

Real-World Scenarios

Software development: A coding agent can load an entire codebase into context at once, enabling end-to-end code generation and debugging without document segmentation.

Financial analysis: An analyst agent can ingest thousands of pages of financial reports in a single pass, cross-referencing data across documents for more consistent analysis.

Cybersecurity: A security orchestration agent can evaluate massive log files and event records holistically, detecting threats faster than agents that process logs in chunks.

RULER benchmark scores confirm this capability is reliable: 96.3% accuracy at 256K tokens, 95.67% at 512K tokens, and 91.75% at 1M tokens.

Enterprise Use Cases and Early Adopters

NVIDIA optimized Nemotron 3 Super specifically for multi-agent systems. Gartner projects that 40% of enterprise applications will embed task-specific AI agents by the end of 2026 — up from less than 5% in 2025. Nemotron 3 Super is positioned to power many of these deployments.

Multi-Agent Orchestration

Enterprise environments increasingly require multiple AI agents working in coordination. Nemotron 3 Super's low-latency inference and high-accuracy tool calling make it well-suited for these scenarios.

Consider an IT helpdesk: separate agents handle ticket classification, root cause diagnosis, and resolution recommendation in parallel. NVIDIA reports significant efficiency gains in these multi-agent configurations, with the model's high-throughput architecture enabling more concurrent agent interactions per GPU.

Early Adopters

Major enterprises are already deploying and customizing Nemotron 3 Super:

Amdocs: Telecom customer service automation
Palantir: Cybersecurity and intelligence analytics
Cadence: Semiconductor design workflows
Dassault Systèmes: Manufacturing and engineering simulation
Siemens: Industrial automation processes

Deployment Options

Enterprises can choose from flexible deployment paths:

Cloud: Google Cloud Vertex AI, Oracle Cloud Infrastructure, Amazon Bedrock (coming soon), Microsoft Azure (coming soon)
On-premise: Dell AI Factory and HPE Agents Hub for in-house deployment
Hybrid: NVIDIA NIM container infrastructure for both cloud and local deployment

Because the model is fully open, organizations with strict data privacy and security requirements can run it on their own infrastructure with complete control over their data.

How to Access and Deploy Nemotron 3 Super

You can access Nemotron 3 Super through multiple channels. Choose the approach that fits your needs.

API Access

The fastest way to start is through hosted APIs:

NVIDIA build.nvidia.com: Try the model directly on NVIDIA's own platform
DeepInfra: $0.10 per 1M input tokens, $0.50 per 1M output tokens
OpenRouter: Free tier available alongside paid options
Nebius: Competitive pricing with high-throughput infrastructure
Perplexity: Available with Pro subscription

Download Model Weights

You can download weights from Hugging Face in multiple precision formats:

BF16: Full precision for research and fine-tuning
FP8: Reduced memory footprint for production deployment
NVFP4: Smallest memory footprint, optimized for Blackwell GPUs

Self-Hosted Deployment

Running Nemotron 3 Super locally requires at least 64 GB of RAM, VRAM, or unified memory. NVIDIA provides ready-to-use cookbooks for three inference engines:

vLLM: High throughput with continuous batching and streaming support
SGLang: Lightweight and fast, optimized for multi-agent tool-calling workloads
TensorRT-LLM: Production-grade low latency with native LatentMoE kernel support

For a deeper understanding of agentic AI architectures and multi-agent design patterns, check out our guide to building AI agent systems.

Fine-Tuning

NVIDIA released customization cookbooks for LoRA/SFT and GRPO/DAPO-based training. The Unsloth platform also provides step-by-step guides for local fine-tuning. NVIDIA is additionally releasing the complete training recipe — covering pretraining through alignment — so you can reproduce Super's training pipeline or adapt it for domain-specific variants.

Nemotron 3 Family: Nano vs Super vs Ultra

Nemotron 3 is not a single model — it is a family of models optimized for different workload tiers.

Feature	Nemotron 3 Nano	Nemotron 3 Super	Nemotron 3 Ultra
Total Parameters	30 billion	120 billion	~500 billion
Active Parameters	3 billion	12 billion	~50 billion
Target Use Case	Lightweight tasks	Multi-agent systems	Deep reasoning
Availability	Available now	Available now	Coming H1 2026

When to Use Each Model

Nemotron 3 Nano excels at content summarization, software debugging, information retrieval, and AI assistant workflows. It delivers 4x higher throughput than the previous generation and reduces reasoning token generation by up to 60%.

Nemotron 3 Super is built for scenarios where multiple agents coordinate on complex tasks. IT ticket automation, financial analysis, cybersecurity orchestration, and software development are its strong suits.

Nemotron 3 Ultra targets deep research, strategic planning, and advanced reasoning applications. NVIDIA is developing it for workloads that demand the highest accuracy regardless of compute cost.

This tiered design lets you match the right model to the right workload. A simple merge request goes to Nano. A complex full-codebase analysis goes to Super. A multi-day research synthesis goes to Ultra.

Conclusion

NVIDIA Nemotron 3 Super marks a turning point for open-source agentic AI. Its hybrid Mamba-Transformer architecture, LatentMoE innovation, and 1-million-token context window create a model that competes with closed alternatives on agentic tasks while remaining fully open.

Four key takeaways:

Efficiency meets power: 120B parameters with only 12B active means lower cost and higher throughput at scale
Best-in-class throughput: 2.2x to 7.5x faster than comparable open models, with ~450 tokens per second output speed
Enterprise-ready: Amdocs, Palantir, and Siemens are already deploying and customizing the model
Fully open: Weights, datasets, and training recipes are all open under a commercial-friendly license

You can start experimenting with Nemotron 3 Super today on build.nvidia.com. As agentic AI moves from experimental to production, models like Nemotron 3 Super will define the infrastructure layer that makes it possible.

Frequently Asked Questions

What is NVIDIA Nemotron 3 Super?

NVIDIA built Nemotron 3 Super as a 120-billion-parameter open model with 12 billion active parameters during inference. It uses a hybrid Mamba-Transformer LatentMoE architecture designed specifically for agentic AI workloads, excelling at multi-agent orchestration, long-context tasks, and software engineering.

How does LatentMoE differ from traditional Mixture of Experts?

Traditional MoE routes tokens directly to expert networks. LatentMoE first projects tokens into a smaller latent dimension before routing, allowing the model to fit 4x more experts within the same compute budget. This improves accuracy per byte without increasing inference cost.

Is Nemotron 3 Super free to use?

Yes. You can download model weights from Hugging Face under the NVIDIA Nemotron Open Model License, which permits commercial use. Free API access is available through build.nvidia.com and OpenRouter. Paid providers like DeepInfra charge approximately $0.10 per million input tokens.

What hardware do I need to run Nemotron 3 Super locally?

You need at least 64 GB of RAM, VRAM, or unified memory. The FP8 and NVFP4 quantized versions significantly reduce memory requirements. NVIDIA provides ready-to-use deployment cookbooks for vLLM, SGLang, and TensorRT-LLM inference engines.

How does Nemotron 3 Super compare to closed models like GPT-5?

Nemotron 3 Super scores 36 on the Artificial Analysis Intelligence Index, behind closed models like GPT-5 (57) and Claude Opus 4.6 (53). However, it leads all open models in its size class on throughput, SWE-Bench Verified (60.47%), and RULER long-context benchmarks (91.75% at 1M tokens). For agentic workloads prioritizing speed and efficiency, it offers a compelling open alternative.

NVIDIA Nemotron 3 Super: The Complete Guide to NVIDIA's Open Agentic AI Model

This guide covers everything you need to know about NVIDIA Nemotron 3 Super: its hybrid architecture, real benchmark numbers, enterprise use cases, and step-by-step deployment options.

What Is NVIDIA Nemotron 3 Super?
Architecture Deep Dive: Mamba-Transformer Meets LatentMoE
Benchmarks: How Nemotron 3 Super Stacks Up
The 1-Million-Token Context Window Explained
Enterprise Use Cases and Early Adopters
How to Access and Deploy Nemotron 3 Super
Nemotron 3 Family: Nano vs Super vs Ultra
Conclusion
Frequently Asked Questions

What Is NVIDIA Nemotron 3 Super?

Here are the core specifications:

Total parameters: 120 billion
Active parameters: Only 12 billion during inference
Training data: NVIDIA trained the model on 25 trillion tokens
Context window: 1 million tokens
Supported languages: English, French, German, Italian, Japanese, Spanish, and Chinese
License: NVIDIA Nemotron Open Model License (commercial use permitted)
Training data cutoff: February 2026 (pre-training: June 2025)

Architecture Deep Dive: Mamba-Transformer Meets LatentMoE

What makes Nemotron 3 Super unique is its fusion of three architectural innovations into a single model. This hybrid design gives it both speed and accuracy advantages over pure Transformer models.

Mamba-2 Layers: Linear-Time Sequence Processing

Transformer Attention: Precision When It Matters

LatentMoE: 4x More Experts, Same Compute Budget

According to NVIDIA, LatentMoE delivers stronger generalization than standard MoE at equivalent compute costs.

Multi-Token Prediction and NVFP4 Training

The model also uses Multi-Token Prediction (MTP) layers that predict multiple tokens per step, boosting generation speed and output quality simultaneously.

Benchmarks: How Nemotron 3 Super Stacks Up

Nemotron 3 Super posts impressive numbers across multiple benchmarks, particularly in agentic and coding tasks. Here is an honest look at where it excels and where it falls short.

Where It Leads

Benchmark	Nemotron 3 Super	vs. GPT-OSS-120B	Category
PinchBench	85.6%	Best open model	Agentic reasoning
DeepResearch Bench	#1	Best open model	Multi-step research
SWE-Bench Verified	60.47%	41.90%	Software engineering
RULER (1M tokens)	91.75%	22.30%	Long-context accuracy
AIME 2025	Class leader	—	Math reasoning

nemotron-benchmark

Throughput Advantage

Speed is where Nemotron 3 Super truly dominates:

5x higher throughput vs. previous-generation Nemotron Super
2.2x higher throughput vs. GPT-OSS-120B
7.5x higher throughput vs. Qwen3.5-122B
Output speed: approximately 450–484 tokens per second depending on provider
Speculative decoding: 3.45 tokens accepted per verification step on SPEED-Bench (vs. 2.70 for DeepSeek-R1), enabling up to 3x wall-clock speedups

The hybrid architecture and NVFP4 training drive these gains. For high-volume enterprise workloads, this throughput advantage translates directly into lower cost per query.

Where It Falls Short

The 1-Million-Token Context Window Explained

A 1-million-token context window is not just about reading long documents. For agentic AI, it solves a fundamental problem.

The Goal Drift Problem

Real-World Scenarios

Software development: A coding agent can load an entire codebase into context at once, enabling end-to-end code generation and debugging without document segmentation.

Financial analysis: An analyst agent can ingest thousands of pages of financial reports in a single pass, cross-referencing data across documents for more consistent analysis.

Cybersecurity: A security orchestration agent can evaluate massive log files and event records holistically, detecting threats faster than agents that process logs in chunks.

RULER benchmark scores confirm this capability is reliable: 96.3% accuracy at 256K tokens, 95.67% at 512K tokens, and 91.75% at 1M tokens.

Enterprise Use Cases and Early Adopters

Multi-Agent Orchestration

Early Adopters

Major enterprises are already deploying and customizing Nemotron 3 Super:

Amdocs: Telecom customer service automation
Palantir: Cybersecurity and intelligence analytics
Cadence: Semiconductor design workflows
Dassault Systèmes: Manufacturing and engineering simulation
Siemens: Industrial automation processes

Deployment Options

Enterprises can choose from flexible deployment paths:

Cloud: Google Cloud Vertex AI, Oracle Cloud Infrastructure, Amazon Bedrock (coming soon), Microsoft Azure (coming soon)
On-premise: Dell AI Factory and HPE Agents Hub for in-house deployment
Hybrid: NVIDIA NIM container infrastructure for both cloud and local deployment

Because the model is fully open, organizations with strict data privacy and security requirements can run it on their own infrastructure with complete control over their data.

How to Access and Deploy Nemotron 3 Super

You can access Nemotron 3 Super through multiple channels. Choose the approach that fits your needs.

API Access

The fastest way to start is through hosted APIs:

NVIDIA build.nvidia.com: Try the model directly on NVIDIA's own platform
DeepInfra: $0.10 per 1M input tokens, $0.50 per 1M output tokens
OpenRouter: Free tier available alongside paid options
Nebius: Competitive pricing with high-throughput infrastructure
Perplexity: Available with Pro subscription

Download Model Weights

You can download weights from Hugging Face in multiple precision formats:

BF16: Full precision for research and fine-tuning
FP8: Reduced memory footprint for production deployment
NVFP4: Smallest memory footprint, optimized for Blackwell GPUs

Self-Hosted Deployment

Running Nemotron 3 Super locally requires at least 64 GB of RAM, VRAM, or unified memory. NVIDIA provides ready-to-use cookbooks for three inference engines:

vLLM: High throughput with continuous batching and streaming support
SGLang: Lightweight and fast, optimized for multi-agent tool-calling workloads
TensorRT-LLM: Production-grade low latency with native LatentMoE kernel support

For a deeper understanding of agentic AI architectures and multi-agent design patterns, check out our guide to building AI agent systems.

Fine-Tuning

Nemotron 3 Family: Nano vs Super vs Ultra

Nemotron 3 is not a single model — it is a family of models optimized for different workload tiers.

Feature	Nemotron 3 Nano	Nemotron 3 Super	Nemotron 3 Ultra
Total Parameters	30 billion	120 billion	~500 billion
Active Parameters	3 billion	12 billion	~50 billion
Target Use Case	Lightweight tasks	Multi-agent systems	Deep reasoning
Availability	Available now	Available now	Coming H1 2026

When to Use Each Model

Nemotron 3 Ultra targets deep research, strategic planning, and advanced reasoning applications. NVIDIA is developing it for workloads that demand the highest accuracy regardless of compute cost.

Conclusion

Four key takeaways:

Efficiency meets power: 120B parameters with only 12B active means lower cost and higher throughput at scale
Best-in-class throughput: 2.2x to 7.5x faster than comparable open models, with ~450 tokens per second output speed
Enterprise-ready: Amdocs, Palantir, and Siemens are already deploying and customizing the model
Fully open: Weights, datasets, and training recipes are all open under a commercial-friendly license

NVIDIA Nemotron 3 Super: The Complete Guide to NVIDIA's Open Agentic AI Model

Table of Contents

What Is NVIDIA Nemotron 3 Super?

Architecture Deep Dive: Mamba-Transformer Meets LatentMoE

Mamba-2 Layers: Linear-Time Sequence Processing

Transformer Attention: Precision When It Matters

LatentMoE: 4x More Experts, Same Compute Budget

Multi-Token Prediction and NVFP4 Training

Benchmarks: How Nemotron 3 Super Stacks Up

Where It Leads

Throughput Advantage

Where It Falls Short

The 1-Million-Token Context Window Explained

The Goal Drift Problem

Real-World Scenarios

Enterprise Use Cases and Early Adopters

Multi-Agent Orchestration

Early Adopters

Deployment Options

How to Access and Deploy Nemotron 3 Super

API Access

Download Model Weights

Self-Hosted Deployment

Fine-Tuning

Nemotron 3 Family: Nano vs Super vs Ultra

When to Use Each Model

Conclusion

Frequently Asked Questions

What is NVIDIA Nemotron 3 Super?

How does LatentMoE differ from traditional Mixture of Experts?

Is Nemotron 3 Super free to use?

What hardware do I need to run Nemotron 3 Super locally?

How does Nemotron 3 Super compare to closed models like GPT-5?

Salih Korkmaz

NVIDIA Nemotron 3 Super: The Complete Guide to NVIDIA's Open Agentic AI Model

Table of Contents

What Is NVIDIA Nemotron 3 Super?

Architecture Deep Dive: Mamba-Transformer Meets LatentMoE

Mamba-2 Layers: Linear-Time Sequence Processing

Transformer Attention: Precision When It Matters

LatentMoE: 4x More Experts, Same Compute Budget

Multi-Token Prediction and NVFP4 Training

Benchmarks: How Nemotron 3 Super Stacks Up

Where It Leads

Throughput Advantage

Where It Falls Short

The 1-Million-Token Context Window Explained

The Goal Drift Problem

Real-World Scenarios

Enterprise Use Cases and Early Adopters

Multi-Agent Orchestration

Early Adopters

Deployment Options

How to Access and Deploy Nemotron 3 Super

API Access

Download Model Weights

Self-Hosted Deployment

Fine-Tuning

Nemotron 3 Family: Nano vs Super vs Ultra

When to Use Each Model

Conclusion

Frequently Asked Questions

What is NVIDIA Nemotron 3 Super?

How does LatentMoE differ from traditional Mixture of Experts?

Is Nemotron 3 Super free to use?

What hardware do I need to run Nemotron 3 Super locally?

How does Nemotron 3 Super compare to closed models like GPT-5?

Salih Korkmaz