learn · serving · ttft

How tokens leave the GPU faster.

A visual walk-through of TTFT and speculative decoding — the two ideas behind almost every recent jump in LLM serving latency. Play with the variables, watch the speedup peak, and see when it collapses.

01 / ttft 02 / speculative decoding 03 / playground 04 / variants

// 01 · time to first token

TTFT is what users actually feel.

Throughput is for the SRE dashboard. TTFT is what you feel when you hit enter. It's the wall-clock time from request → first token streamed back, dominated by prefill.

Most low-latency tricks either shrink prefill (paged KV-cache, chunked prefill, prompt caching) or overlap it with the first decode (continuous batching).

// ttft = tokenize + prefill + decode₁

total 84 ms

tokenize 3ms

prefill 68ms

decode₁ 13ms

tokenize

split prompt into tokens

prefill

compute KV-cache over the prompt

decode₁

first forward pass · one token

prefill is the heavy one — it's a single forward pass over the entire prompt. Decode₁ is fast (one token) but you only get to start it once the KV-cache is warm. Lower TTFT → faster perceived response.

02 · speculative decoding

Draft cheap. Verify in parallel. Commit the prefix.

A small draft model proposes k tokens. The big target model verifies all k in one forward pass. Whatever prefix survives gets committed — sometimes you ship 5 tokens for the price of 1.

// speculative decoding · live

round 01/4 · stage idle

draft model · M_q

small · ~10× faster

The

quick

brown

fox

jumps

target model · M_p

large · one parallel forward pass

forward(prompt + drafts) →

The

quick

brown

fox

leaps

verify · accept-reject

rejection sampling along the prefix

drafted

0/5

accepted

—

committed this round

—

// committed output

total drafted 0 · accepted 0

// waiting for first round…

Draft k tokens

Cheap autoregressive run on a small model. ~5–10× faster than the target.

One target forward

Run M_p over prompt + drafts in parallel. Get a probability distribution at every position.

Accept the prefix

Walk left-to-right. Accept while P_target / P_draft ≥ rand(). On reject, sample from the residual.

spec_decode.pypython

while not done:
    drafts, p_draft = draft_model.sample(prompt, k=K)
    logits_target  = target_model.forward(prompt + drafts)

    accepted = 0
    for i, tok in enumerate(drafts):
        p_t = softmax(logits_target[i])[tok]
        if random() < min(1.0, p_t / p_draft[i]):
            accepted += 1
        else:
            break  # reject + sample from residual at this position

    prompt.extend(drafts[:accepted])
    if accepted < K:
        prompt.append(sample_residual(logits_target[accepted], p_draft[accepted]))
    elif accepted == K:
        prompt.append(sample(logits_target[K]))  # bonus token

// 03 · the math, on a slider

Why bigger k isn't always better.

Speculative decoding has a cliff. Push k too high with a low acceptance rate α and you spend draft time on tokens you'll throw away. The optimal k depends on α, c, and your workload.

α high (≥ 0.85) → push k up to 5–7
α mid (0.6–0.8) → k = 3–5 is the sweet spot
α low (< 0.5) → don't speculate; pick a better draft

// playground · speedup model

Drag the sliders. Watch when the speedup peaks and when it collapses.

acceptance rate · α0.70

fraction of draft tokens the target accepts

draft length · k5

tokens proposed by the draft each round

draft cost · c0.15

draft-forward time ÷ target-forward time

E[tokens/step] =

(1 − α^k+1) ÷ (1 − α) = 2.94

speedup ≈

E[tokens] ÷ (1 + c·k) = 1.68×

speedup vs draft length

peak at k=3 · 1.75×

1.48×

k=1

1.68×

k=2

1.75×

k=3

1.73×

k=4

1.68×

k=5

1.61×

k=6

1.53×

k=7

1.45×

k=8

speedup > 1 means net win. As α drops, longer drafts waste compute and the curve collapses.

// 04 · variants

All the ways to draft.

Every flavour you'll see in production is a different answer to "where does the draft come from?"

01 / 04

leviathan · 2023

Vanilla speculative

Pair a small draft model M_q with the target M_p. Verify k draft tokens in one M_p forward pass.

02 / 04

medusa · 2024

Medusa

Add several decoding heads to the target model itself. Heads propose, the base verifies — no separate draft model.

03 / 04

eagle · 2024

EAGLE

Train a tiny network on top of the target's hidden states. Re-uses the KV-cache and improves acceptance.

04 / 04

lookahead · 2024

Lookahead decoding

Use the target itself to generate n-gram drafts via Jacobi iteration. No extra model, but spikier acceptance.

the take-away

TTFT comes from prefill. Throughput comes from parallel decode.

Speculative decoding is the cleanest answer when α is high enough and the draft is cheap enough. Everything else — Medusa, EAGLE, Lookahead — is moving where the draft comes from.

see the research say hi →