Conformance for Inference: How We Reduced Bad Deploys on a GPU Platform
KubeCon + CloudNativeCon Japan 2026
Abstract
Inference on GPUs fails in repetitive ways: wrong image or artifact, mismatched CUDA or runtime, undersized GPU memory, bad resource requests, or a model that passes offline checks but regresses under real traffic. On a shared Kubernetes GPU platform, those mistakes become multi-tenant incidents — noisy neighbors, OOMKills, SLO breaches, and rollbacks that waste accelerator time. This talk describes how one team built conformance for inference workloads: checks applied before production traffic, covering container and model artifacts, GPU capacity and visibility contracts, health and readiness semantics, and minimum observability through metrics and traces where used. Scheduled for Thursday, 30 July 2026, 2:10 PM–2:40 PM Tokyo Standard Time in Level 4, rooms 414+415. Attendees leave with a practical checklist they can reuse: how to separate builds from serving conformance, how to catch regressions early, and how to align GPU scheduling and quotas with inference SLOs. The session also shares what worked, what did not, what teams pushed back on, and a short checklist for platform and application owners.
Resources
More Talks
- Conference
Help! My LLM is a Resource Hog: How We Tamed Inference with Kubernetes and Open Source Muscle
KubeCon + CloudNativeCon North America 2025 · Atlanta, USA
- Conference
Phippy's First Steps into Kubernetes
KubeCon + CloudNativeCon India 2026 · Mumbai, India
- Conference
GitOps Your Costs: Automated FinOps Through Argo Workflows
ArgoCon 2026 · Amsterdam, Netherlands
- Conference
Helm for Beginners
Kubernetes Community Days Africa 2022 (Online) · Lagos, Nigeria