Speculative Decoding Bench
Benchmark harness for draft-target speculative decoding across draft sizes, acceptance thresholds, and workloads. Includes tree-attention sampling and adaptive lookahead.

Applied AI engineer working across the inference stack — from model compression and quantization to high-throughput serving on tight latency budgets.
Part-time researcher exploring speculative decoding, draft-target alignment, and KV-cache scheduling. Applied first, papers second.
Inference is where models meet reality. Latency is product. Throughput is unit economics. Both have to land.
Hover a layer below. Each slab is a level of the inference stack — from tokens going in to tokens streaming out.
Lead on LLM inference. Shipped a continuous-batching engine that cut P99 TTFT by 38% under multi-tenant load. Designed a speculative-decoding pipeline that delivered ~2.4× speedup on summarisation workloads.
Built and deployed real-time detection models for edge devices. Brought inference latency from 90ms → 28ms via pruning, distillation and INT8 quantization on Jetson-class hardware.
Worked on CNN interpretability and feature-map visualization. Wrote tooling that became the default introspection layer for the group's vision projects.
All measured. All reproducible. Numbers come from a single A100 unless noted.
Benchmark harness for draft-target speculative decoding across draft sizes, acceptance thresholds, and workloads. Includes tree-attention sampling and adaptive lookahead.
Minimal continuous-batching engine I wrote to understand vLLM internals. Paged KV-cache, prefill / decode separation, SSE streaming over a thin HTTP layer.
Compressed a vision-language model to fit a 6GB Jetson budget while preserving 96% downstream accuracy via AWQ + LoRA finetuning.
Open to applied-research collabs, consulting on LLM inference, and full-time roles working on the runtime layer.