warming up gpu kernels…

node:gpu-0 · cuda 12.4 · vllm-engine

prompt → tokens → response

applied · ai · engineer/part-time researcher

Build the model.Serve the tokens.Cut the TTFT.

stream · stdout200 OK

See the work research focus say hi →

TTFT

0 ms

throughput

1.8k tok/s

model

llama-3-8b

decode

speculative

scroll ↓

TTFT/speculative decoding/paged KV-cache/FlashAttention/vLLM/AWQ / GPTQ/FP8/CUDA graphs/Triton/continuous batching/EAGLE/Medusa/LoRA/tensor parallel/GQA/TTFT/speculative decoding/paged KV-cache/FlashAttention/vLLM/AWQ / GPTQ/FP8/CUDA graphs/Triton/continuous batching/EAGLE/Medusa/LoRA/tensor parallel/GQA/

// about

abu dhabi · uae● available

I build the layer below the model — the runtime, the kernels, the schedulers that let tokens leave the GPU before you finish your sentence.

role

Applied AI engineer working across the inference stack — from model compression and quantization to high-throughput serving on tight latency budgets.

research

Part-time researcher exploring speculative decoding, draft-target alignment, and KV-cache scheduling. Applied first, papers second.

stack

LLM servingspeculative decodingvLLM / TGICUDAPyTorchVLMsedge inferenceRAGevals

// research focus

Two problems I think about most.

Inference is where models meet reality. Latency is product. Throughput is unit economics. Both have to land.

01 / decoding

speedup 2.4×

Speculative Decoding

Drafting tokens with a small model and verifying with a target. Working on draft–target alignment, tree-attention, and adaptive draft length to hit 2–3× speedups without quality regression.

Draft-target acceptance rate analysis
Tree-based speculative sampling
Adaptive lookahead under load

explore notes

02 / serving

TTFT P99 84ms

LLM Inference Serving

Continuous batching, paged attention, and KV-cache eviction at the engine layer. Squeezing throughput out of single nodes before reaching for more GPUs.

Continuous batching + paged KV
Quantization (AWQ, GPTQ, FP8)
TTFT under multi-tenant load

explore notes

// skills

Skills mapped to the layers I work in.

Hover a layer below. Each slab is a level of the inference stack — from tokens going in to tokens streaming out.

loading layers…

model.forward()stack.depth = 6

↑ tokens flowing throughhover · interact

// experience

Where I've shipped.

2023 — present
Principal Data Scientist · AIQ, a G42 company (Abu Dhabi, UAE)
Lead on LLM inference. Shipped a continuous-batching engine that cut P99 TTFT by 38% under multi-tenant load. Designed a speculative-decoding pipeline that delivered ~2.4× speedup on summarisation workloads.
vLLMspeculative decodingCUDATriton
2021 — 2023
Computer Vision Specialist · AIQ, a G42 company (Abu Dhabi, UAE)
Built and deployed real-time detection models for edge devices. Brought inference latency from 90ms → 28ms via pruning, distillation and INT8 quantization on Jetson-class hardware.
edgequantizationTensorRTYOLO
2021 — 2023
ML Research Engineer · University Research Group
Worked on CNN interpretability and feature-map visualization. Wrote tooling that became the default introspection layer for the group's vision projects.
interpretabilityPyTorchvisualization

// projects

Things I've built & benchmarked.

All measured. All reproducible. Numbers come from a single A100 unless noted.

proj_01

speedup 2.4×

Speculative Decoding Bench

Benchmark harness for draft-target speculative decoding across draft sizes, acceptance thresholds, and workloads. Includes tree-attention sampling and adaptive lookahead.

LLMdecodingPyTorchTriton

code paper

proj_02

TTFT P99 84 ms

Tiny-vLLM

Minimal continuous-batching engine I wrote to understand vLLM internals. Paged KV-cache, prefill / decode separation, SSE streaming over a thin HTTP layer.

servingCUDAFastAPI

demo code

proj_03

size −74%