Abstract

Large language models are getting faster GPUs every year, yet users still notice the pause before the first word appears. That pause has a name: Time To First Token (TTFT). And in production LLM systems, shaving even a few hundred milliseconds from it can dramatically change how responsive an application feels. This talk tells the story of where those milliseconds go. We will walk through the lifecycle of a request in modern LLM serving systems and explore the practical techniques engineers use to reduce TTFT in real deployments. Using examples from open source stacks like vLLM, TensorRT-LLM, and Hugging Face TGI, we will examine four powerful optimization levers: KV cache strategies, speculative decoding, model quantization, and batching policies. Instead of focusing only on theory, the session highlights the tradeoffs practitioners face. When does speculative decoding actually help? When does batching hurt latency? When does quantization reduce memory pressure enough to speed up the first token? Attendees will leave with a practical playbook for diagnosing TTFT bottlenecks and choosing the right optimization strategy for their model, infrastructure, and workload.

Squeezing Every Millisecond: A Practical Guide to Optimizing Time To First Token with OSS Muscle

Abstract

Resources

More Talks

Help! My LLM is a Resource Hog: How We Tamed Inference with Kubernetes and Open Source Muscle

Conformance for Inference: How We Reduced Bad Deploys on a GPU Platform

Open Source is Not the Same Anymore

Accelerating CI Pipelines: Rapid Kubernetes Testing with vCluster