Vanilla speculative
Pair a small draft model M_q with the target M_p. Verify k draft tokens in one M_p forward pass.
A visual walk-through of TTFT and speculative decoding — the two ideas behind almost every recent jump in LLM serving latency. Play with the variables, watch the speedup peak, and see when it collapses.
Throughput is for the SRE dashboard. TTFT is what you feel when you hit enter. It's the wall-clock time from request → first token streamed back, dominated by prefill.
Most low-latency tricks either shrink prefill (paged KV-cache, chunked prefill, prompt caching) or overlap it with the first decode (continuous batching).
A small draft model proposes k tokens. The big target model verifies all k in one forward pass. Whatever prefix survives gets committed — sometimes you ship 5 tokens for the price of 1.
Cheap autoregressive run on a small model. ~5–10× faster than the target.
Run M_p over prompt + drafts in parallel. Get a probability distribution at every position.
Walk left-to-right. Accept while P_target / P_draft ≥ rand(). On reject, sample from the residual.
while not done:
drafts, p_draft = draft_model.sample(prompt, k=K)
logits_target = target_model.forward(prompt + drafts)
accepted = 0
for i, tok in enumerate(drafts):
p_t = softmax(logits_target[i])[tok]
if random() < min(1.0, p_t / p_draft[i]):
accepted += 1
else:
break # reject + sample from residual at this position
prompt.extend(drafts[:accepted])
if accepted < K:
prompt.append(sample_residual(logits_target[accepted], p_draft[accepted]))
elif accepted == K:
prompt.append(sample(logits_target[K])) # bonus tokenSpeculative decoding has a cliff. Push k too high with a low acceptance rate α and you spend draft time on tokens you'll throw away. The optimal k depends on α, c, and your workload.
Every flavour you'll see in production is a different answer to "where does the draft come from?"
Pair a small draft model M_q with the target M_p. Verify k draft tokens in one M_p forward pass.
Add several decoding heads to the target model itself. Heads propose, the base verifies — no separate draft model.
Train a tiny network on top of the target's hidden states. Re-uses the KV-cache and improves acceptance.
Use the target itself to generate n-gram drafts via Jacobi iteration. No extra model, but spikier acceptance.
Speculative decoding is the cleanest answer when α is high enough and the draft is cheap enough. Everything else — Medusa, EAGLE, Lookahead — is moving where the draft comes from.