❌

Normal view

Received before yesterday

Delivering Massive Performance Leaps for Mixture of Experts Inference on NVIDIA Blackwell

8 January 2026 at 19:43
As AI models continue to get smarter, people can rely on them for an expanding set of tasks. This leads usersβ€”from consumers to enterprisesβ€”to interact with...

As AI models continue to get smarter, people can rely on them for an expanding set of tasks. This leads usersβ€”from consumers to enterprisesβ€”to interact with AI more frequently, meaning that more tokens need to be generated. To serve these tokens at the lowest possible cost, AI platforms need to deliver the best possible token throughput per watt. Through extreme co-design across GPUs, CPUs…

Source

Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT-LLM

16 December 2025 at 21:00
For machine learning engineers deploying LLMs at scale, the equation is familiar and unforgiving: as context length increases, attention computation costs...

For machine learning engineers deploying LLMs at scale, the equation is familiar and unforgiving: as context length increases, attention computation costs explode. Whether you’re dealing with retrieval-augmented generation (RAG) pipelines, agentic AI workflows, or long-form content generation, the complexity of attention remains a primary bottleneck. This post explains a technique known as…

Source

❌