Abstract

LLM inference is the new resource hog. GPUs sit underutilised, model loading dominates cold-start time, and teams ship workloads that look fine in isolation but fall over the moment another tenant lands on the same node. This KubeCon NA 2025 session walks through how we tamed inference on Kubernetes using open source primitives — sensible scheduling, GPU sharing strategies, and tenant isolation patterns that prevent one model from starving another. Expect a tour of the trade-offs between vertical scaling, multi-instance GPUs, and tenant clusters, with examples drawn from production deployments.

Help! My LLM is a Resource Hog: How We Tamed Inference with Kubernetes and Open Source Muscle

Abstract

Resources

More Talks

Stop the GPU Madness! Making LLM Inference Actually Efficient on K8s

Phippy's First Steps into Kubernetes

GitOps Your Costs: Automated FinOps Through Argo Workflows

Helm for Beginners