Adaptive Inference in NVIDIA TensorRT for RTX Enables Automatic Optimization

26 January 2026 at 21:00

desktop-laptop-screens-displaying-high-res-graphics

Deploying AI applications across diverse consumer hardware has traditionally forced a trade-off. You can optimize for specific GPU configurations and achieve...

Deploying AI applications across diverse consumer hardware has traditionally forced a trade-off. You can optimize for specific GPU configurations and achieve peak performance at the cost of portability. Alternatively, you can build generic, portable engines and leave performance on the table. Bridging this gap often requires manual tuning, multiple build targets, or accepting compromises.

Source

Scaling NVFP4 Inference for FLUX.2 on NVIDIA Blackwell Data Center GPUs

NVIDIA Technical Blog

By:Sandro Cavallari

22 January 2026 at 19:21

In 2025, NVIDIA partnered with Black Forest Labs (BFL) to optimize the FLUX.1 text-to-image model series, unlocking FP4 image generation performance on NVIDIA...

In 2025, NVIDIA partnered with Black Forest Labs (BFL) to optimize the FLUX.1 text-to-image model series, unlocking FP4 image generation performance on NVIDIA Blackwell GeForce RTX 50 Series GPUs. As a natural extension of the latent diffusion model, FLUX.1 Kontext [dev] proved that in-context learning is a feasible technique for visual-generation models, not just large language models (LLMs).

Source

Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM

NVIDIA Technical Blog

By:Lin Chai

8 January 2026 at 17:28

Large language models (LLMs) and multimodal reasoning systems are rapidly expanding beyond the data center. Automotive and robotics developers increasingly want...

Large language models (LLMs) and multimodal reasoning systems are rapidly expanding beyond the data center. Automotive and robotics developers increasingly want to run conversational AI agents, multimodal perception, and high-level planning directly on the vehicle or robot – where latency, reliability, and the ability to operate offline matter most. While many existing LLM and vision language…

Source

Introducing NVIDIA BlueField-4-Powered Inference Context Memory Storage Platform for the Next Frontier of AI

NVIDIA Technical Blog

By:Moshe Anschel

6 January 2026 at 17:30

AI‑native organizations increasingly face scaling challenges as agentic AI workflows drive context windows to millions of tokens and models scale toward...

AI‑native organizations increasingly face scaling challenges as agentic AI workflows drive context windows to millions of tokens and models scale toward trillions of parameters. These systems currently rely on agentic long‑term memory for context that persists across turns, tools, and sessions so agents can build on prior reasoning instead of starting from scratch on every request.

Source

Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT-LLM

NVIDIA Technical Blog

By:Laikh Tewari

16 December 2025 at 21:00

For machine learning engineers deploying LLMs at scale, the equation is familiar and unforgiving: as context length increases, attention computation costs...

For machine learning engineers deploying LLMs at scale, the equation is familiar and unforgiving: as context length increases, attention computation costs explode. Whether you’re dealing with retrieval-augmented generation (RAG) pipelines, agentic AI workflows, or long-form content generation, the complexity of attention remains a primary bottleneck. This post explains a technique known as…

Source

Top 5 AI Model Optimization Techniques for Faster, Smarter Inference

NVIDIA Technical Blog

By:Eduardo Alvarez

9 December 2025 at 18:00

As AI models get larger and architectures more complex, researchers and engineers are continuously finding new techniques to optimize the performance and...

As AI models get larger and architectures more complex, researchers and engineers are continuously finding new techniques to optimize the performance and overall cost of bringing AI systems to production. Model optimization is a category of techniques focused on addressing inference service efficiency. These techniques represent the best “bang for buck” opportunities to optimize cost…

Source

Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

NVIDIA Technical Blog

By:Eduardo Alvarez

8 December 2025 at 17:00

Quantization is one of the strongest levers for large-scale inference. By reducing the precision of weights, activations, and KV cache, we can reduce the memory...

Quantization is one of the strongest levers for large-scale inference. By reducing the precision of weights, activations, and KV cache, we can reduce the memory footprint and compute cost—directly improving throughput, latency, and achievable context length. This blog introduces NVFP4 KV cache quantization, a new KV format that enables significant performance gains on NVIDIA Blackwell GPUs.

Source

Building Scalable and Fault-Tolerant NCCL Applications

NVIDIA Technical Blog

By:Luke Robison

10 November 2025 at 21:29

The NVIDIA Collective Communications Library (NCCL) provides communication APIs for low-latency and high-bandwidth collectives, enabling AI workloads to scale...

The NVIDIA Collective Communications Library (NCCL) provides communication APIs for low-latency and high-bandwidth collectives, enabling AI workloads to scale from just a few GPUs on a single host to thousands of GPUs in a data center. This post discusses NCCL features that support run-time rescaling for cost optimization, as well as minimizing service downtime from faults by dynamically removing…

Source

How to Achieve 4x Faster Inference for Math Problem Solving

NVIDIA Technical Blog

By:Igor Gitman

10 November 2025 at 16:44

$Decorative math image.$ Large language models can solve challenging math problems. However, making them work efficiently at scale requires more than a strong checkpoint. You need the... $Decorative math image.$

Large language models can solve challenging math problems. However, making them work efficiently at scale requires more than a strong checkpoint. You need the right serving stack, quantization strategy, and decoding methods—often spread across different tools that don’t work together cleanly. Teams end up juggling containers, conversion scripts, and ad‑hoc glue code to compare BF16 vs FP8 or to…

Source

Streamline Complex AI Inference on Kubernetes with NVIDIA Grove

NVIDIA Technical Blog

By:Sanjay Chatterjee

10 November 2025 at 14:00

Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now...

Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now consist of several distinct components—prefill, decode, vision encoders, key value (KV) routers, and more. In addition, entire agentic pipelines are emerging, where multiple such model instances collaborate to perform reasoning, retrieval…

Source

Reading view