TTFT — time to first token — is the latency users actually feel. It's the wall-clock interval from "I hit enter" to "the first token shows up." Almost every recent jump in LLM serving performance, from vLLM to TensorRT-LLM to chunked prefill, is in some way an attack on TTFT. So it's worth knowing how to estimate it with a pencil before you reach for a profiler.
This post draws on the framing in apxml.com's TTFT estimator and fills in the math I keep on the back of an envelope.
What counts as TTFT
Strictly: the time from the request hitting your serving endpoint to the first token byte being written to the wire.
In production, two of these dominate everything else: T_queue (how long your request waited in the scheduler) and T_prefill (the forward pass over your prompt). The other terms — tokenization, the first decode forward, and the bytes flushing to the client — add up to a few milliseconds at most.
Prefill is the term that matters
Once your request is at the head of the queue, the GPU has to chew through the entire prompt in one forward pass. This builds the KV-cache that decode will then sample from. That single forward pass is the prefill, and it's where 70–95% of TTFT lives for any non-trivial prompt.
Prefill is special compared to decode for one reason: it's a sequence of length N, not 1. Every weight in the model is loaded once from HBM and reused across all N positions. That changes the bottleneck dramatically.
Counting the FLOPs
The forward pass through a dense transformer touches each non-embedding parameter twice per token (one multiply, one add). So the linear (weight-matmul) part of a forward pass is:
Self-attention adds a quadratic term — it computes an N×N score matrix per attention head per layer:
where L is layers, h is attention heads (or KV heads for GQA on the K and V projections), and d is the head dimension. For short prompts this term is dust; at long context it can outgrow the linear part. More on that in a minute.
Roofline: compute or memory?
You can't just divide FLOPs by peak TFLOPS — the GPU may be waiting on HBM instead. The roofline model says you're bounded by the worse of two terms:
The crossover is arithmetic intensity — FLOPs per byte loaded. For a transformer weight matmul the weights are bytes, and you do roughly 2·N ops per byte (each weight participates in N token-multiplies, twice). So your AI for prefill is:
Compare to the GPU's ridge point (peak FLOPS ÷ peak HBM bandwidth):
| GPU | Peak BF16 TFLOPS | HBM BW | Ridge point |
|---|---|---|---|
| A100 80GB | 312 | 2.0 TB/s | ~156 FLOPs/byte |
| H100 80GB | 989 | 3.35 TB/s | ~295 FLOPs/byte |
| H200 141GB | 989 | 4.8 TB/s | ~206 FLOPs/byte |
| MI300X | 1307 | 5.3 TB/s | ~247 FLOPs/byte |
A worked example: Llama-3-8B, 2k tokens, H100
Plug in real numbers.
| quantity | value |
|---|---|
| params P | 8.0 × 10⁹ |
| layers L | 32 |
| heads (Q / KV) | 32 / 8 (GQA) |
| head dim d | 128 |
| prompt length N | 2048 |
| dtype | BF16 |
Linear FLOPs first:
Attention FLOPs:
With H100 BF16 peak 989 TFLOPS and a realistic MFU of 0.45 for prefill (typical for vLLM / TensorRT-LLM batched prefill):
Add ~5 ms for the first decode forward (next section), a couple of ms for tokenization, and the streaming flush. If the request waited in the scheduler — say 20 ms under load — your TTFT lands near ~105 ms. That's a plausible P50 for a healthy deployment.
When attention starts to matter
The linear term grows as N. The attention term grows as N². Where do they meet?
For Llama-3-8B that's 8e9 / (2·32·32·128) ≈ 30,500 tokens. So under ~30k context the linear term still dominates; past that, attention starts eating prefill time. This is exactly why FlashAttention 2/3 and chunked attention kernels matter more at long context — they don't change the FLOP count, but they cut HBM traffic inside attention by an order of magnitude.
The first decode forward
After prefill, you do one more forward pass with N=1 to sample the first output token. This one is memory-bound: AI ≈ 1, far below the ridge point. The cost is reading the model weights from HBM one more time.
For Llama-3-8B on H100 with MBU ≈ 0.7: 16 GB / (3350 · 0.7) ≈ 6.8 ms. The KV-cache also has to be read, but for a 2k context with GQA it's only ~270 MB — under a millisecond.
Levers that actually move TTFT
Things that help
- Prefix caching. If a prompt shares a prefix with a recent one (system prompt, few-shot examples), reuse its KV-cache. This skips the most expensive part of prefill entirely. Easy 3–10× win on system-prompt-heavy workloads.
- FP8 / INT8 weights. Halves the bytes you load and roughly doubles your compute throughput on H100 / MI300X. Direct ~1.6–1.9× prefill speedup with FP8.
- FlashAttention 2/3. Eliminates the materialisation of the N×N score matrix in HBM. Marginal at short context, huge past 16k.
- Paged attention + continuous batching. Doesn't speed up your prefill, but cuts
T_queueby letting more requests share the GPU efficiently. P99 TTFT lives or dies here. - Tensor / pipeline parallelism. More FLOPS at the cost of inter-GPU comms. Below ~1k tokens, comms overhead canincrease TTFT.
Things that don't help TTFT (even if they help other things)
- Speculative decoding. Helps ITL (inter-token latency) — i.e. tokens after the first — but TTFT is the same. You still pay one full prefill.
- Bigger batches. Improves throughput, often hurts P99 TTFT under load.
- Sampling tricks (top-k, top-p). The cost of sampling is negligible; you can't optimise here.
Estimation checklist
- Compute
2·P·Nlinear FLOPs. - If
N > ~16k, add the attention term. - Divide by
peak_TFLOPS · 0.45for prefill time. - Add
2·P·sizeof(dtype) / (HBM_BW · 0.7)for first decode. - Add 10–30 ms for queue + scheduler + streaming flush.
If your measured TTFT is 2–3× higher than what this gives you, the problem is almost never the math — it's queueing, tokenizer overhead you forgot about, or an MFU on the floor because your kernels aren't fused. Start there.
def estimate_ttft(
P: float, # params
N: int, # prompt tokens
L: int, h: int, d: int, # layers, heads, head dim
dtype_bytes: float, # 2 for BF16, 1 for FP8
peak_tflops: float, # GPU peak (dense BF16 or FP8 equivalent)
hbm_bw_gbs: float, # HBM bandwidth
mfu: float = 0.45, # model FLOPs utilisation
mbu: float = 0.70, # memory bandwidth utilisation
queue_ms: float = 15.0, # scheduler + tokenize + flush
):
linear_flops = 2 * P * N
attn_flops = 4 * L * h * d * N * N
t_prefill_ms = 1e3 * (linear_flops + attn_flops) / (peak_tflops * 1e12 * mfu)
weight_bytes = 2 * P * dtype_bytes
t_decode1_ms = 1e3 * weight_bytes / (hbm_bw_gbs * 1e9 * mbu)
return queue_ms + t_prefill_ms + t_decode1_msRun it for your model + GPU before you spin up a load test. If the numbers say 80 ms but you're getting 600 ms, the bottleneck isn't where you think it is.
Related: Speculative decoding, the math, and when it breaks.