back to research
// notes · decoding · math

Speculative decoding — the math, and when it breaks.

A small model proposes k tokens, the big model verifies them in one parallel pass. The trick is rejection sampling — but the trick has a cliff, and most production deployments are sitting on the wrong side of it.

by suraj sharan2026 · 059 min read

Vanilla autoregressive decoding generates one token per forward pass. That makes the inter-token latency (ITL) of a 70B model a straightforward function of HBM bandwidth: every step you load ~140 GB of weights to produce a single token. Speculative decoding breaks that 1:1 by spending cheap compute to verify many tokens in one expensive pass.

This post is a companion to the visual walk-through on /learn, with the math behind the bar chart.

Why it works

A transformer forward pass over a sequence of length n+k already gives you the next-token distribution at every position — including positions n through n+k-1. Vanilla decoding throws those away. Speculative decoding uses them as cheap verifications for a draft sequence.

The brilliant part — from Leviathan, Kalman & Matias, 2023 — is that you can do this without changing the output distribution. The accept/reject step uses rejection sampling so the final stream is statistically identical to sampling from the target model alone. Lossless.

The loop, in one screen

One round consists of:

  1. Draft. Run a small model M_q autoregressively for k steps, producing draft tokens x₁..x_k with proposal probabilities q(x_i | context).
  2. Verify. Run the target M_p once over the prompt plus drafts. Get the target distribution p(· | context, x₁..x_{i-1}) at every position.
  3. Accept-reject. Walk positions left to right. At each i, accept x_i with probability min(1, p(x_i) / q(x_i)). On the first reject, sample a replacement from the residual (p − q)₊.
  4. Commit. The accepted prefix plus the residual sample become the new state. If all k accept, you get a free bonus token from p(· | x₁..x_k).
spec_decode.pypython
while not done:
    drafts, q = draft_model.sample(prompt, k=K)
    p_logits  = target_model.forward(prompt + drafts)
    p         = softmax(p_logits)

    accepted = 0
    for i, x in enumerate(drafts):
        if random() < min(1.0, p[i][x] / q[i][x]):
            accepted += 1
        else:
            # sample from residual at this position
            residual = relu(p[i] - q[i])
            prompt.append(sample(residual / residual.sum()))
            break

    prompt.extend(drafts[:accepted])
    if accepted == K:                # all drafts accepted
        prompt.append(sample(p[K]))  # bonus token from M_p
lossless, properly
The combined process — accept by ratio, fall back to (p−q)₊ — provably preserves the target distribution. You are not "approximating" the big model; you are sampling exactly from it, just with fewer forward passes.

The math: expected tokens per round

Let α be the average per-position acceptance probability — basically, how often q ≈ p. The number of accepted tokens before the first rejection is geometric-ish. Counting the bonus token on full acceptance, the expectation is:

E[tokens per round] = (1 − α^(k+1)) / (1 − α)
(Leviathan et al. 2023)

A few sanity checks: at α = 1 (always accept), the formula evaluates to k + 1. At α = 0 it collapses to 1 (only the residual token, which you'd also have gotten from vanilla decoding). Both match intuition.

α \ k12358
0.51.501.751.881.971.99
0.71.702.192.532.923.16
0.851.852.573.184.084.81
0.951.952.853.715.317.16
E[tokens/round] for typical α and k

Speedup, with friction

Tokens-per-round is the upper bound. To get wall-clock speedup you have to pay for the draft. Let c be the draft cost ratio — the time of one draft forward divided by one target forward — and model the per-round wall-clock as:

speedup ≈ E[tokens] / (1 + c · k)
(speedup approximation)

The intuition: the target costs 1 per round (one big forward over the drafts), the draft costs c per token times k tokens. You ship E[tokens] in that wall-clock window.

With a draft that's ~10× cheaper than the target, c ≈ 0.1. For Llama-3-70B paired with a 1B draft on the same hardware, that ratio is realistic.

α = 0.8, k = 5, c = 0.1 ⟹ E[tokens] ≈ 3.36, speedup ≈ 2.24×

The α–k cliff

The optimal k depends on α and c. Push k too high and you spend draft time on tokens you'll throw away. Push it too low and you don't amortise the verifier.

Differentiating the speedup formula and solving gives:

k* = argmaxk   (1 − α^(k+1)) / [(1 − α)(1 + c · k)]
αbest kspeedup
0.5021.32×
0.7031.74×
0.8552.50×
0.9062.99×
0.9583.93×
optimal k at c = 0.15
watch out
If your α < 0.55, almost no choice of k will net a meaningful win — the draft is too misaligned with the target. Don't tune k; train (or pick) a better draft.

Where the draft comes from

"Speculative decoding" has come to mean a whole family of techniques that differ on the answer to one question: where does the draft come from?

variantdraft sourceα (typical)extra train cost
Vanilla spec (Leviathan)separate small LM0.6–0.85low
Medusaextra heads on target0.6–0.75moderate
EAGLEtiny net on hidden states0.8–0.9moderate
Lookahead decodingJacobi iteration on targetspikynone
Tree-attn (SpecInfer, EAGLE-2)multi-branch draft tree≥0.85 eff.moderate
comparison · all numbers indicative

Tree attention is the next obvious step once you've squeezed plain speculative — you draft a small tree of candidate continuations and accept along the best path. With shared KV across branches, this is almost free verification-side, and pushes effective α well above the single-branch ceiling.

What it doesn't help

  • TTFT. You still pay one full prefill before you can draft anything. See the TTFT post.
  • Memory-bound batch=1 decode where weights dominate. Spec helps when the bottleneck is "one forward per token"; if your target is small and decode is already cheap, gains are marginal.
  • Highly creative sampling. Higher temperature → lower α → smaller speedup. At T = 1.5, expect 60–70% of the speedup you saw at T = 0.7.

Tuning checklist

getting speculative right
  1. Measure α on a representative eval set, not on the prompts that look easy.
  2. Set k from the table at your measured α and c.
  3. Track committed-tokens-per-target-forward as your north-star metric.
  4. Watch P99, not P50 — variance is the real cost of speculative.
  5. If α drops > 5% week over week, the draft has drifted. Refresh it.

Related: How to estimate LLM time-to-first-token.

// next

How to estimate LLM time-to-first-token

read it