Abstract

Inference optimization gets discussed as one big problem, which is why teams end up tuning the wrong layer and wondering why latency or cost barely moved. This CNCG Colombo session splits the work into three layers and shows what each one actually controls. 1. Model layer. Quantization, distillation, speculative decoding, and the trade-offs against accuracy. 2. Runtime layer. Batching strategies, KV cache management, paged attention, and how serving engines like vLLM and TGI change the picture. 3. Infrastructure layer. GPU sharing, autoscaling on the right signal, tenant isolation, and the scheduling decisions that decide whether a node is full or wasted. For each layer: what to measure, what to change first, and where the diminishing returns kick in. Attendees leave with a mental model for diagnosing which layer is the bottleneck before reaching for the next optimization.

Breaking Down Inference Optimization: The Three Different Layers

Abstract

More Talks

Stop the GPU Madness! Making LLM Inference Actually Efficient on K8s

Help! My LLM is a Resource Hog: How We Tamed Inference with Kubernetes and Open Source Muscle

Conformance for Inference: How We Reduced Bad Deploys on a GPU Platform

Squeezing Every Millisecond: A Practical Guide to Optimizing Time To First Token with OSS Muscle