Abstract

Inference on GPUs fails in repetitive ways: wrong image or artifact, mismatched CUDA or runtime, undersized GPU memory, bad resource requests, or a model that passes offline checks but regresses under real traffic. On a shared Kubernetes GPU platform, those mistakes become multi-tenant incidents — noisy neighbors, OOMKills, SLO breaches, and rollbacks that waste accelerator time. This talk describes how one team built conformance for inference workloads: checks applied before production traffic, covering container and model artifacts, GPU capacity and visibility contracts, health and readiness semantics, and minimum observability through metrics and traces where used. Scheduled for Thursday, 30 July 2026, 2:10 PM–2:40 PM Tokyo Standard Time in Level 4, rooms 414+415. Attendees leave with a practical checklist they can reuse: how to separate builds from serving conformance, how to catch regressions early, and how to align GPU scheduling and quotas with inference SLOs. The session also shares what worked, what did not, what teams pushed back on, and a short checklist for platform and application owners.

Conformance for Inference: How We Reduced Bad Deploys on a GPU Platform

Abstract

Resources

More Talks

Help! My LLM is a Resource Hog: How We Tamed Inference with Kubernetes and Open Source Muscle

Phippy's First Steps into Kubernetes

GitOps Your Costs: Automated FinOps Through Argo Workflows

Helm for Beginners