Nvidia's new Nemotron 3 Ultra is not subtle about its intended job. The
company describes it as an open 550-billion-parameter mixture-of-experts model
for long-running agents: systems that plan, call tools, write code, research,
and keep working past the tidy two-minute demo. The useful number is not only
550 billion. It is 55 billion active parameters, the slice doing work for each
token, because Nvidia's pitch is about capability that can still be served
without turning every agent run into a boutique compute event.
The release landed on June 4 in an Nvidia Developer post by Chris Alexiuk and
Chintan Patel, with the technical report and model card filling in the harder
details. Nemotron 3 Ultra uses a hybrid Mamba-Transformer design, LatentMoE,
multi-token prediction, and NVFP4 quantization. MarkTechPost's summary adds
that the model was pre-trained on 20 trillion tokens and extended to a
one-million-token context window. The technical report gives the fuller title:
"Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic
Reasoning." A title like that doesn't leave much room for mystery.
I don't read this as Nvidia trying to become another chatbot company. That
would be too small, and probably the wrong business. The interesting move is
that Nvidia is making the model legible to the machinery around it: NIM
packaging, Hugging Face weights, build.nvidia.com, OpenRouter, Perplexity,
Anaconda, and the usual cloud-adjacent routes by which a research artefact
becomes a thing people can actually run. This is the company that already owns
too much of the physical substrate of AI deciding that the software layer
above the chips should also speak its language.
The open part matters because it changes where trust is supposed to sit.
Nvidia is not asking developers to admire a leaderboard number from outside
the glass. It is offering weights, data, recipes, quantized checkpoints, and
deployment routes, then daring the rest of the stack to form around them. The
Hugging Face card lists the OpenMDW 1.1 license, a June 4 release date, the
550B total and 55B active parameter counts, and the same one-million-token
context length. That is a very specific kind of openness: not a hobbyist toy,
not a sealed API, but an industrial object with enough handles for other
companies to build around.
There is a familiar Nvidia pattern here. In the AlexNet
story, two GTX 580 cards
helped make a neural-network result suddenly impossible to ignore. The
hardware did not invent deep learning, but it changed what could be repeated
quickly enough to matter. Nemotron 3 Ultra sits much further up the stack, yet
the logic rhymes. Nvidia keeps turning constraints into platforms: memory
limits, throughput, quantization, inference cost, now the long, expensive
drift of agent work.
The risk is that "open" becomes another way of tightening the ecosystem. If
the model, serving layer, quantization format, optimization story, and preferred
deployment surface all line up around Nvidia, developers gain access while the
centre of gravity still moves toward the same vendor. That is not hypocrisy.
It is strategy. A closed API can own the user relationship. An open model can
own the default architecture, especially when the default architecture has to
run somewhere expensive.
Nvidia says the supporting video describes up to five times faster inference
and up to 30 percent lower cost. Those numbers need the usual caution, because
vendor launch claims are not neutral field reports. However, they point at the
real argument. Long-running agents are not only a model-quality problem. They
are a patience problem, a scheduling problem, a bill problem, a question of
whether anyone wants to wait while a machine thinks, searches, checks, fails,
and tries again. If Nvidia can make that loop cheaper and more deployable, the
model itself is only part of the release.
This also explains why the China-chip story keeps haunting Nvidia's AI news.
I wrote recently about Beijing banning Nvidia's compromised
5090D V2, a reminder
that hardware access is political before it is technical. Nemotron 3 Ultra is
the other side of the same company: not merely selling scarce accelerators,
but shaping the work those accelerators are meant to do. The agent future,
if it arrives in the form Nvidia wants, won't just need chips. It will need a
stack, a license, a recipe, and somewhere to run.