Squeezing Every Millisecond: A Practical Guide to Optimizing Time To First Token with OSS Muscle
Open Source Summit Korea 2026
Abstract
Large language models are getting faster GPUs every year, yet users still notice the pause before the first word appears. That pause has a name: Time To First Token (TTFT). And in production LLM systems, shaving even a few hundred milliseconds from it can dramatically change how responsive an application feels. This talk tells the story of where those milliseconds go. We will walk through the lifecycle of a request in modern LLM serving systems and explore the practical techniques engineers use to reduce TTFT in real deployments. Using examples from open source stacks like vLLM, TensorRT-LLM, and Hugging Face TGI, we will examine four powerful optimization levers: KV cache strategies, speculative decoding, model quantization, and batching policies. Instead of focusing only on theory, the session highlights the tradeoffs practitioners face. When does speculative decoding actually help? When does batching hurt latency? When does quantization reduce memory pressure enough to speed up the first token? Attendees will leave with a practical playbook for diagnosing TTFT bottlenecks and choosing the right optimization strategy for their model, infrastructure, and workload.
Resources
More Talks
- Conference
Help! My LLM is a Resource Hog: How We Tamed Inference with Kubernetes and Open Source Muscle
KubeCon + CloudNativeCon North America 2025 · Atlanta, USA
- Conference
Conformance for Inference: How We Reduced Bad Deploys on a GPU Platform
KubeCon + CloudNativeCon Japan 2026 · Tokyo, Japan
- Conference
Open Source is Not the Same Anymore
Open Source India 2026 · Mumbai, India
- Conference
Accelerating CI Pipelines: Rapid Kubernetes Testing with vCluster
FOSDEM 2025 · Brussels, Belgium