Vanilla autoregressive decoding generates one token per forward pass. That makes the inter-token latency (ITL) of a 70B model a straightforward function of HBM bandwidth: every step you load ~140 GB of weights to produce a single token. Speculative decoding breaks that 1:1 by spending cheap compute to verify many tokens in one expensive pass.
This post is a companion to the visual walk-through on /learn, with the math behind the bar chart.
Why it works
A transformer forward pass over a sequence of length n+k already gives you the next-token distribution at every position — including positions n through n+k-1. Vanilla decoding throws those away. Speculative decoding uses them as cheap verifications for a draft sequence.
The brilliant part — from Leviathan, Kalman & Matias, 2023 — is that you can do this without changing the output distribution. The accept/reject step uses rejection sampling so the final stream is statistically identical to sampling from the target model alone. Lossless.
The loop, in one screen
One round consists of:
- Draft. Run a small model
M_qautoregressively for k steps, producing draft tokensx₁..x_kwith proposal probabilitiesq(x_i | context). - Verify. Run the target
M_ponce over the prompt plus drafts. Get the target distributionp(· | context, x₁..x_{i-1})at every position. - Accept-reject. Walk positions left to right. At each
i, acceptx_iwith probabilitymin(1, p(x_i) / q(x_i)). On the first reject, sample a replacement from the residual(p − q)₊. - Commit. The accepted prefix plus the residual sample become the new state. If all k accept, you get a free bonus token from
p(· | x₁..x_k).
while not done:
drafts, q = draft_model.sample(prompt, k=K)
p_logits = target_model.forward(prompt + drafts)
p = softmax(p_logits)
accepted = 0
for i, x in enumerate(drafts):
if random() < min(1.0, p[i][x] / q[i][x]):
accepted += 1
else:
# sample from residual at this position
residual = relu(p[i] - q[i])
prompt.append(sample(residual / residual.sum()))
break
prompt.extend(drafts[:accepted])
if accepted == K: # all drafts accepted
prompt.append(sample(p[K])) # bonus token from M_p(p−q)₊ — provably preserves the target distribution. You are not "approximating" the big model; you are sampling exactly from it, just with fewer forward passes.The math: expected tokens per round
Let α be the average per-position acceptance probability — basically, how often q ≈ p. The number of accepted tokens before the first rejection is geometric-ish. Counting the bonus token on full acceptance, the expectation is:
A few sanity checks: at α = 1 (always accept), the formula evaluates to k + 1. At α = 0 it collapses to 1 (only the residual token, which you'd also have gotten from vanilla decoding). Both match intuition.
| α \ k | 1 | 2 | 3 | 5 | 8 |
|---|---|---|---|---|---|
| 0.5 | 1.50 | 1.75 | 1.88 | 1.97 | 1.99 |
| 0.7 | 1.70 | 2.19 | 2.53 | 2.92 | 3.16 |
| 0.85 | 1.85 | 2.57 | 3.18 | 4.08 | 4.81 |
| 0.95 | 1.95 | 2.85 | 3.71 | 5.31 | 7.16 |
Speedup, with friction
Tokens-per-round is the upper bound. To get wall-clock speedup you have to pay for the draft. Let c be the draft cost ratio — the time of one draft forward divided by one target forward — and model the per-round wall-clock as:
The intuition: the target costs 1 per round (one big forward over the drafts), the draft costs c per token times k tokens. You ship E[tokens] in that wall-clock window.
With a draft that's ~10× cheaper than the target, c ≈ 0.1. For Llama-3-70B paired with a 1B draft on the same hardware, that ratio is realistic.
The α–k cliff
The optimal k depends on α and c. Push k too high and you spend draft time on tokens you'll throw away. Push it too low and you don't amortise the verifier.
Differentiating the speedup formula and solving gives:
| α | best k | speedup |
|---|---|---|
| 0.50 | 2 | 1.32× |
| 0.70 | 3 | 1.74× |
| 0.85 | 5 | 2.50× |
| 0.90 | 6 | 2.99× |
| 0.95 | 8 | 3.93× |
α < 0.55, almost no choice of k will net a meaningful win — the draft is too misaligned with the target. Don't tune k; train (or pick) a better draft.Where the draft comes from
"Speculative decoding" has come to mean a whole family of techniques that differ on the answer to one question: where does the draft come from?
| variant | draft source | α (typical) | extra train cost |
|---|---|---|---|
| Vanilla spec (Leviathan) | separate small LM | 0.6–0.85 | low |
| Medusa | extra heads on target | 0.6–0.75 | moderate |
| EAGLE | tiny net on hidden states | 0.8–0.9 | moderate |
| Lookahead decoding | Jacobi iteration on target | spiky | none |
| Tree-attn (SpecInfer, EAGLE-2) | multi-branch draft tree | ≥0.85 eff. | moderate |
Tree attention is the next obvious step once you've squeezed plain speculative — you draft a small tree of candidate continuations and accept along the best path. With shared KV across branches, this is almost free verification-side, and pushes effective α well above the single-branch ceiling.
What it doesn't help
- TTFT. You still pay one full prefill before you can draft anything. See the TTFT post.
- Memory-bound batch=1 decode where weights dominate. Spec helps when the bottleneck is "one forward per token"; if your target is small and decode is already cheap, gains are marginal.
- Highly creative sampling. Higher temperature → lower α → smaller speedup. At
T = 1.5, expect 60–70% of the speedup you saw atT = 0.7.
Tuning checklist
- Measure α on a representative eval set, not on the prompts that look easy.
- Set k from the table at your measured α and c.
- Track committed-tokens-per-target-forward as your north-star metric.
- Watch P99, not P50 — variance is the real cost of speculative.
- If α drops > 5% week over week, the draft has drifted. Refresh it.
Related: How to estimate LLM time-to-first-token.