Suraj Sharan · Applied AI Engineer

Vanilla autoregressive decoding generates one token per forward pass. That makes the inter-token latency (ITL) of a 70B model a straightforward function of HBM bandwidth: every step you load ~140 GB of weights to produce a single token. Speculative decoding breaks that 1:1 by spending cheap compute to verify many tokens in one expensive pass.

This post is a companion to the visual walk-through on /learn, with the math behind the bar chart.

Why it works

A transformer forward pass over a sequence of length n+k already gives you the next-token distribution at every position — including positions n through n+k-1. Vanilla decoding throws those away. Speculative decoding uses them as cheap verifications for a draft sequence.

The brilliant part — from Leviathan, Kalman & Matias, 2023 — is that you can do this without changing the output distribution. The accept/reject step uses rejection sampling so the final stream is statistically identical to sampling from the target model alone. Lossless.

The loop, in one screen

One round consists of:

Draft. Run a small model M_q autoregressively for k steps, producing draft tokens x₁..x_k with proposal probabilities q(x_i | context).
Verify. Run the target M_p once over the prompt plus drafts. Get the target distribution p(· | context, x₁..x_{i-1}) at every position.
Accept-reject. Walk positions left to right. At each i, accept x_i with probability min(1, p(x_i) / q(x_i)). On the first reject, sample a replacement from the residual (p − q)₊.
Commit. The accepted prefix plus the residual sample become the new state. If all k accept, you get a free bonus token from p(· | x₁..x_k).

spec_decode.pypython

while not done:
    drafts, q = draft_model.sample(prompt, k=K)
    p_logits  = target_model.forward(prompt + drafts)
    p         = softmax(p_logits)

    accepted = 0
    for i, x in enumerate(drafts):
        if random() < min(1.0, p[i][x] / q[i][x]):
            accepted += 1
        else:
            # sample from residual at this position
            residual = relu(p[i] - q[i])
            prompt.append(sample(residual / residual.sum()))
            break

    prompt.extend(drafts[:accepted])
    if accepted == K:                # all drafts accepted
        prompt.append(sample(p[K]))  # bonus token from M_p

lossless, properly

The combined process — accept by ratio, fall back to (p−q)₊ — provably preserves the target distribution. You are not "approximating" the big model; you are sampling exactly from it, just with fewer forward passes.

The math: expected tokens per round

Let α be the average per-position acceptance probability — basically, how often q ≈ p. The number of accepted tokens before the first rejection is geometric-ish. Counting the bonus token on full acceptance, the expectation is:

E[tokens per round] = (1 − α^(k+1)) / (1 − α)

(Leviathan et al. 2023)

A few sanity checks: at α = 1 (always accept), the formula evaluates to k + 1. At α = 0 it collapses to 1 (only the residual token, which you'd also have gotten from vanilla decoding). Both match intuition.

α \ k	1	2	3	5	8
0.5	1.50	1.75	1.88	1.97	1.99
0.7	1.70	2.19	2.53	2.92	3.16
0.85	1.85	2.57	3.18	4.08	4.81
0.95	1.95	2.85	3.71	5.31	7.16

E[tokens/round] for typical α and k

Speedup, with friction

Tokens-per-round is the upper bound. To get wall-clock speedup you have to pay for the draft. Let c be the draft cost ratio — the time of one draft forward divided by one target forward — and model the per-round wall-clock as:

speedup ≈ E[tokens] / (1 + c · k)

(speedup approximation)

The intuition: the target costs 1 per round (one big forward over the drafts), the draft costs c per token times k tokens. You ship E[tokens] in that wall-clock window.

With a draft that's ~10× cheaper than the target, c ≈ 0.1. For Llama-3-70B paired with a 1B draft on the same hardware, that ratio is realistic.

α = 0.8, k = 5, c = 0.1 ⟹ E[tokens] ≈ 3.36, speedup ≈ 2.24×

The α–k cliff

The optimal k depends on α and c. Push k too high and you spend draft time on tokens you'll throw away. Push it too low and you don't amortise the verifier.

Differentiating the speedup formula and solving gives:

k^* = argmax_k (1 − α^(k+1)) / [(1 − α)(1 + c · k)]

α	best k	speedup
0.50	2	1.32×
0.70	3	1.74×
0.85	5	2.50×
0.90	6	2.99×
0.95	8	3.93×

optimal k at c = 0.15

watch out

If your α < 0.55, almost no choice of k will net a meaningful win — the draft is too misaligned with the target. Don't tune k; train (or pick) a better draft.

Where the draft comes from

"Speculative decoding" has come to mean a whole family of techniques that differ on the answer to one question: where does the draft come from?

variant	draft source	α (typical)	extra train cost
Vanilla spec (Leviathan)	separate small LM	0.6–0.85	low
Medusa	extra heads on target	0.6–0.75	moderate
EAGLE	tiny net on hidden states	0.8–0.9	moderate
Lookahead decoding	Jacobi iteration on target	spiky	none
Tree-attn (SpecInfer, EAGLE-2)	multi-branch draft tree	≥0.85 eff.	moderate

comparison · all numbers indicative

Tree attention is the next obvious step once you've squeezed plain speculative — you draft a small tree of candidate continuations and accept along the best path. With shared KV across branches, this is almost free verification-side, and pushes effective α well above the single-branch ceiling.

What it doesn't help

TTFT. You still pay one full prefill before you can draft anything. See the TTFT post.
Memory-bound batch=1 decode where weights dominate. Spec helps when the bottleneck is "one forward per token"; if your target is small and decode is already cheap, gains are marginal.
Highly creative sampling. Higher temperature → lower α → smaller speedup. At T = 1.5, expect 60–70% of the speedup you saw at T = 0.7.

Tuning checklist

getting speculative right

Measure α on a representative eval set, not on the prompts that look easy.
Set k from the table at your measured α and c.
Track committed-tokens-per-target-forward as your north-star metric.
Watch P99, not P50 — variance is the real cost of speculative.
If α drops > 5% week over week, the draft has drifted. Refresh it.

Speculative decoding — the math, and when it breaks.