Breaking Down Inference Optimization: The Three Different Layers
CNCG Colombo
Abstract
Inference optimization gets discussed as one big problem, which is why teams end up tuning the wrong layer and wondering why latency or cost barely moved. This CNCG Colombo session splits the work into three layers and shows what each one actually controls. 1. Model layer. Quantization, distillation, speculative decoding, and the trade-offs against accuracy. 2. Runtime layer. Batching strategies, KV cache management, paged attention, and how serving engines like vLLM and TGI change the picture. 3. Infrastructure layer. GPU sharing, autoscaling on the right signal, tenant isolation, and the scheduling decisions that decide whether a node is full or wasted. For each layer: what to measure, what to change first, and where the diminishing returns kick in. Attendees leave with a mental model for diagnosing which layer is the bottleneck before reaching for the next optimization.
More Talks
- Meetup
Stop the GPU Madness! Making LLM Inference Actually Efficient on K8s
AWS User Group Jaipur · Jaipur, India
- Conference
Help! My LLM is a Resource Hog: How We Tamed Inference with Kubernetes and Open Source Muscle
KubeCon + CloudNativeCon North America 2025 · Atlanta, USA
- Conference
Conformance for Inference: How We Reduced Bad Deploys on a GPU Platform
KubeCon + CloudNativeCon Japan 2026 · Tokyo, Japan
- Conference
Phippy's First Steps into Kubernetes
KubeCon + CloudNativeCon India 2026 · Mumbai, India