learn · serving · ttft

How tokens leave the GPU faster.

A visual walk-through of TTFT and speculative decoding — the two ideas behind almost every recent jump in LLM serving latency. Play with the variables, watch the speedup peak, and see when it collapses.

// 01 · time to first token

TTFT is what users actually feel.

Throughput is for the SRE dashboard. TTFT is what you feel when you hit enter. It's the wall-clock time from request → first token streamed back, dominated by prefill.

Most low-latency tricks either shrink prefill (paged KV-cache, chunked prefill, prompt caching) or overlap it with the first decode (continuous batching).

// ttft = tokenize + prefill + decode₁
total 84 ms
tokenize 3ms
prefill 68ms
decode₁ 13ms
tokenize
split prompt into tokens
prefill
compute KV-cache over the prompt
decode₁
first forward pass · one token
prefill is the heavy one — it's a single forward pass over the entire prompt. Decode₁ is fast (one token) but you only get to start it once the KV-cache is warm. Lower TTFT → faster perceived response.
02 · speculative decoding

Draft cheap. Verify in parallel. Commit the prefix.

A small draft model proposes k tokens. The big target model verifies all k in one forward pass. Whatever prefix survives gets committed — sometimes you ship 5 tokens for the price of 1.

// speculative decoding · live
round 01/4 · stage idle
draft model · M_q
small · ~10× faster
The
t1
quick
t2
brown
t3
fox
t4
jumps
t5
target model · M_p
large · one parallel forward pass
forward(prompt + drafts) →
The
quick
brown
fox
leaps
verify · accept-reject
rejection sampling along the prefix
·
·
·
·
·
drafted
0/5
accepted
committed this round
// committed output
total drafted 0 · accepted 0
// waiting for first round…
01

Draft k tokens

Cheap autoregressive run on a small model. ~5–10× faster than the target.

02

One target forward

Run M_p over prompt + drafts in parallel. Get a probability distribution at every position.

03

Accept the prefix

Walk left-to-right. Accept while P_target / P_draft ≥ rand(). On reject, sample from the residual.

spec_decode.pypython
while not done:
    drafts, p_draft = draft_model.sample(prompt, k=K)
    logits_target  = target_model.forward(prompt + drafts)

    accepted = 0
    for i, tok in enumerate(drafts):
        p_t = softmax(logits_target[i])[tok]
        if random() < min(1.0, p_t / p_draft[i]):
            accepted += 1
        else:
            break  # reject + sample from residual at this position

    prompt.extend(drafts[:accepted])
    if accepted < K:
        prompt.append(sample_residual(logits_target[accepted], p_draft[accepted]))
    elif accepted == K:
        prompt.append(sample(logits_target[K]))  # bonus token
// 03 · the math, on a slider

Why bigger k isn't always better.

Speculative decoding has a cliff. Push k too high with a low acceptance rate α and you spend draft time on tokens you'll throw away. The optimal k depends on α, c, and your workload.

  • α high (≥ 0.85) → push k up to 5–7
  • α mid (0.6–0.8) → k = 3–5 is the sweet spot
  • α low (< 0.5) → don't speculate; pick a better draft
// playground · speedup model
Drag the sliders. Watch when the speedup peaks and when it collapses.
E[tokens/step] =
(1 − αk+1) ÷ (1 − α) = 2.94
speedup ≈
E[tokens] ÷ (1 + c·k) = 1.68×
speedup vs draft length
peak at k=3 · 1.75×
1.48×
k=1
1.68×
k=2
1.75×
k=3
1.73×
k=4
1.68×
k=5
1.61×
k=6
1.53×
k=7
1.45×
k=8
speedup > 1 means net win. As α drops, longer drafts waste compute and the curve collapses.
// 04 · variants

All the ways to draft.

Every flavour you'll see in production is a different answer to "where does the draft come from?"

01 / 04
leviathan · 2023

Vanilla speculative

Pair a small draft model M_q with the target M_p. Verify k draft tokens in one M_p forward pass.

02 / 04
medusa · 2024

Medusa

Add several decoding heads to the target model itself. Heads propose, the base verifies — no separate draft model.

03 / 04
eagle · 2024

EAGLE

Train a tiny network on top of the target's hidden states. Re-uses the KV-cache and improves acceptance.

04 / 04
lookahead · 2024

Lookahead decoding

Use the target itself to generate n-gram drafts via Jacobi iteration. No extra model, but spikier acceptance.

the take-away

TTFT comes from prefill. Throughput comes from parallel decode.

Speculative decoding is the cleanest answer when α is high enough and the draft is cheap enough. Everything else — Medusa, EAGLE, Lookahead — is moving where the draft comes from.