Adaptive Inference in NVIDIA TensorRT for RTX Enables Automatic Optimization

26 January 2026 at 21:00

desktop-laptop-screens-displaying-high-res-graphics

Deploying AI applications across diverse consumer hardware has traditionally forced a trade-off. You can optimize for specific GPU configurations and achieve...

Deploying AI applications across diverse consumer hardware has traditionally forced a trade-off. You can optimize for specific GPU configurations and achieve peak performance at the cost of portability. Alternatively, you can build generic, portable engines and leave performance on the table. Bridging this gap often requires manual tuning, multiple build targets, or accepting compromises.

Source

Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM

NVIDIA Technical Blog

By:Lin Chai

8 January 2026 at 17:28

Large language models (LLMs) and multimodal reasoning systems are rapidly expanding beyond the data center. Automotive and robotics developers increasingly want...

Large language models (LLMs) and multimodal reasoning systems are rapidly expanding beyond the data center. Automotive and robotics developers increasingly want to run conversational AI agents, multimodal perception, and high-level planning directly on the vehicle or robot – where latency, reliability, and the ability to operate offline matter most. While many existing LLM and vision language…

Source

Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT-LLM

NVIDIA Technical Blog

By:Laikh Tewari

16 December 2025 at 21:00

For machine learning engineers deploying LLMs at scale, the equation is familiar and unforgiving: as context length increases, attention computation costs...

For machine learning engineers deploying LLMs at scale, the equation is familiar and unforgiving: as context length increases, attention computation costs explode. Whether you’re dealing with retrieval-augmented generation (RAG) pipelines, agentic AI workflows, or long-form content generation, the complexity of attention remains a primary bottleneck. This post explains a technique known as…

Source

Top 5 AI Model Optimization Techniques for Faster, Smarter Inference

NVIDIA Technical Blog

By:Eduardo Alvarez

9 December 2025 at 18:00

As AI models get larger and architectures more complex, researchers and engineers are continuously finding new techniques to optimize the performance and...

As AI models get larger and architectures more complex, researchers and engineers are continuously finding new techniques to optimize the performance and overall cost of bringing AI systems to production. Model optimization is a category of techniques focused on addressing inference service efficiency. These techniques represent the best “bang for buck” opportunities to optimize cost…

Source

Streamline Complex AI Inference on Kubernetes with NVIDIA Grove

NVIDIA Technical Blog

By:Sanjay Chatterjee

10 November 2025 at 14:00

Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now...

Over the past few years, AI inference has evolved from single-model, single-pod deployments into complex, multicomponent systems. A model deployment may now consist of several distinct components—prefill, decode, vision encoders, key value (KV) routers, and more. In addition, entire agentic pipelines are emerging, where multiple such model instances collaborate to perform reasoning, retrieval…

Source

Reading view