Suraj Sharan · Applied AI Engineer

TTFT — time to first token — is the latency users actually feel. It's the wall-clock interval from "I hit enter" to "the first token shows up." Almost every recent jump in LLM serving performance, from vLLM to TensorRT-LLM to chunked prefill, is in some way an attack on TTFT. So it's worth knowing how to estimate it with a pencil before you reach for a profiler.

This post draws on the framing in apxml.com's TTFT estimator and fills in the math I keep on the back of an envelope.

What counts as TTFT

Strictly: the time from the request hitting your serving endpoint to the first token byte being written to the wire.

TTFT ≈ T_queue + T_tok + T_prefill + T_decode₁ + T_stream

(components of TTFT)

In production, two of these dominate everything else: T_queue (how long your request waited in the scheduler) and T_prefill (the forward pass over your prompt). The other terms — tokenization, the first decode forward, and the bytes flushing to the client — add up to a few milliseconds at most.

note

Don't conflate TTFT with throughput. Throughput is your steady-state tokens/sec across all in-flight requests. TTFT is the P50/P95/P99 of per-request first-byte latency. The two optimisations are different — sometimes opposite.

Prefill is the term that matters

Once your request is at the head of the queue, the GPU has to chew through the entire prompt in one forward pass. This builds the KV-cache that decode will then sample from. That single forward pass is the prefill, and it's where 70–95% of TTFT lives for any non-trivial prompt.

Prefill is special compared to decode for one reason: it's a sequence of length N, not 1. Every weight in the model is loaded once from HBM and reused across all N positions. That changes the bottleneck dramatically.

Counting the FLOPs

The forward pass through a dense transformer touches each non-embedding parameter twice per token (one multiply, one add). So the linear (weight-matmul) part of a forward pass is:

FLOPs_linear ≈ 2 · P · N (P = params, N = tokens)

(linear FLOPs)

Self-attention adds a quadratic term — it computes an N×N score matrix per attention head per layer:

FLOPs_attn ≈ 4 · L · h · d · N²

(attention FLOPs)

where L is layers, h is attention heads (or KV heads for GQA on the K and V projections), and d is the head dimension. For short prompts this term is dust; at long context it can outgrow the linear part. More on that in a minute.

Roofline: compute or memory?

You can't just divide FLOPs by peak TFLOPS — the GPU may be waiting on HBM instead. The roofline model says you're bounded by the worse of two terms:

T ≈ max( FLOPs / (FLOPS · MFU), bytes / (BW · MBU) )

(roofline)

The crossover is arithmetic intensity — FLOPs per byte loaded. For a transformer weight matmul the weights are bytes, and you do roughly 2·N ops per byte (each weight participates in N token-multiplies, twice). So your AI for prefill is:

AI_prefill ≈ N (BF16, weights only)

(prefill arithmetic intensity)

Compare to the GPU's ridge point (peak FLOPS ÷ peak HBM bandwidth):

GPU	Peak BF16 TFLOPS	HBM BW	Ridge point
A100 80GB	312	2.0 TB/s	~156 FLOPs/byte
H100 80GB	989	3.35 TB/s	~295 FLOPs/byte
H200 141GB	989	4.8 TB/s	~206 FLOPs/byte
MI300X	1307	5.3 TB/s	~247 FLOPs/byte

ridge point ≈ ops/byte you need to be compute-bound (BF16 dense)

rule of thumb

On an H100, prefill becomes compute-bound around N ≳ 300 tokens. Below that, you're bandwidth-bound and TTFT tracks HBM, not FLOPS. Batching helps short prompts.

A worked example: Llama-3-8B, 2k tokens, H100

Plug in real numbers.

quantity	value
params P	8.0 × 10⁹
layers L	32
heads (Q / KV)	32 / 8 (GQA)
head dim d	128
prompt length N	2048
dtype	BF16

Llama-3-8B sketch

Linear FLOPs first:

2 · 8 × 10⁹ · 2048 = 32.8 TFLOPs

Attention FLOPs:

4 · 32 · 32 · 128 · 2048² = ≈ 2.2 TFLOPs (6.5% of total)

With H100 BF16 peak 989 TFLOPS and a realistic MFU of 0.45 for prefill (typical for vLLM / TensorRT-LLM batched prefill):

T_prefill ≈ (32.8 + 2.2) / (989 · 0.45) ≈ 79 ms

(prefill, compute-bound)

Add ~5 ms for the first decode forward (next section), a couple of ms for tokenization, and the streaming flush. If the request waited in the scheduler — say 20 ms under load — your TTFT lands near ~105 ms. That's a plausible P50 for a healthy deployment.

watch out

MFU is a vibes number, not a constant. Prefill MFU varies 0.3–0.6 depending on tensor parallelism, attention impl (FlashAttention helps), and how well the kernel scheduler fuses ops. Always sanity check against your actual stack.

When attention starts to matter

The linear term grows as N. The attention term grows as N². Where do they meet?

2 · P · N = 4 · L · h · d · N² ⟹ N_× ≈ P / (2 · L · h · d)

(crossover length)

For Llama-3-8B that's 8e9 / (2·32·32·128) ≈ 30,500 tokens. So under ~30k context the linear term still dominates; past that, attention starts eating prefill time. This is exactly why FlashAttention 2/3 and chunked attention kernels matter more at long context — they don't change the FLOP count, but they cut HBM traffic inside attention by an order of magnitude.

The first decode forward

After prefill, you do one more forward pass with N=1 to sample the first output token. This one is memory-bound: AI ≈ 1, far below the ridge point. The cost is reading the model weights from HBM one more time.

T_decode₁ ≈ 2 · P · sizeof(dtype) / (BW · MBU)

(first decode, memory-bound)

For Llama-3-8B on H100 with MBU ≈ 0.7: 16 GB / (3350 · 0.7) ≈ 6.8 ms. The KV-cache also has to be read, but for a 2k context with GQA it's only ~270 MB — under a millisecond.

note

This is also why decode (token-by-token) is always memory-bound at batch size 1. Speculative decoding beats this by verifying multiple tokens in one forward — see the companion post.

Levers that actually move TTFT

Things that help

Prefix caching. If a prompt shares a prefix with a recent one (system prompt, few-shot examples), reuse its KV-cache. This skips the most expensive part of prefill entirely. Easy 3–10× win on system-prompt-heavy workloads.
FP8 / INT8 weights. Halves the bytes you load and roughly doubles your compute throughput on H100 / MI300X. Direct ~1.6–1.9× prefill speedup with FP8.
FlashAttention 2/3. Eliminates the materialisation of the N×N score matrix in HBM. Marginal at short context, huge past 16k.
Paged attention + continuous batching. Doesn't speed up your prefill, but cuts T_queue by letting more requests share the GPU efficiently. P99 TTFT lives or dies here.
Tensor / pipeline parallelism. More FLOPS at the cost of inter-GPU comms. Below ~1k tokens, comms overhead canincrease TTFT.

Things that don't help TTFT (even if they help other things)

Speculative decoding. Helps ITL (inter-token latency) — i.e. tokens after the first — but TTFT is the same. You still pay one full prefill.
Bigger batches. Improves throughput, often hurts P99 TTFT under load.
Sampling tricks (top-k, top-p). The cost of sampling is negligible; you can't optimise here.

Estimation checklist

estimating TTFT in one minute

Compute 2·P·N linear FLOPs.
If N > ~16k, add the attention term.
Divide by peak_TFLOPS · 0.45 for prefill time.
Add 2·P·sizeof(dtype) / (HBM_BW · 0.7) for first decode.
Add 10–30 ms for queue + scheduler + streaming flush.

If your measured TTFT is 2–3× higher than what this gives you, the problem is almost never the math — it's queueing, tokenizer overhead you forgot about, or an MFU on the floor because your kernels aren't fused. Start there.

ttft_estimator.pypython

def estimate_ttft(
    P: float,          # params
    N: int,            # prompt tokens
    L: int, h: int, d: int,  # layers, heads, head dim
    dtype_bytes: float,      # 2 for BF16, 1 for FP8
    peak_tflops: float,      # GPU peak (dense BF16 or FP8 equivalent)
    hbm_bw_gbs: float,       # HBM bandwidth
    mfu: float = 0.45,       # model FLOPs utilisation
    mbu: float = 0.70,       # memory bandwidth utilisation
    queue_ms: float = 15.0,  # scheduler + tokenize + flush
):
    linear_flops = 2 * P * N
    attn_flops   = 4 * L * h * d * N * N
    t_prefill_ms = 1e3 * (linear_flops + attn_flops) / (peak_tflops * 1e12 * mfu)

    weight_bytes  = 2 * P * dtype_bytes
    t_decode1_ms  = 1e3 * weight_bytes / (hbm_bw_gbs * 1e9 * mbu)

    return queue_ms + t_prefill_ms + t_decode1_ms

Run it for your model + GPU before you spin up a load test. If the numbers say 80 ms but you're getting 600 ms, the bottleneck isn't where you think it is.

How to estimate LLM Time-to-First-Token.