warming up gpu kernels…
node:gpu-0 · cuda 12.4 · vllm-engine
applied · ai · engineer/part-time researcher

Build the model.Serve the tokens.Cut the TTFT.

stream · stdout200 OK
TTFT
0 ms
throughput
1.8k tok/s
model
llama-3-8b
decode
speculative
scroll ↓
TTFT/speculative decoding/paged KV-cache/FlashAttention/vLLM/AWQ / GPTQ/FP8/CUDA graphs/Triton/continuous batching/EAGLE/Medusa/LoRA/tensor parallel/GQA/TTFT/speculative decoding/paged KV-cache/FlashAttention/vLLM/AWQ / GPTQ/FP8/CUDA graphs/Triton/continuous batching/EAGLE/Medusa/LoRA/tensor parallel/GQA/
// about
Suraj Sharan
abu dhabi · uae● available

I build the layer below the model — the runtime, the kernels, the schedulers that let tokens leave the GPU before you finish your sentence.

role

Applied AI engineer working across the inference stack — from model compression and quantization to high-throughput serving on tight latency budgets.

research

Part-time researcher exploring speculative decoding, draft-target alignment, and KV-cache scheduling. Applied first, papers second.

stack
LLM servingspeculative decodingvLLM / TGICUDAPyTorchVLMsedge inferenceRAGevals
// skills

Skills mapped to the layers I work in.

Hover a layer below. Each slab is a level of the inference stack — from tokens going in to tokens streaming out.

loading layers…
model.forward()stack.depth = 6
↑ tokens flowing throughhover · interact
// experience

Where I've shipped.

  • 2023 — present

    Principal Data Scientist · AIQ, a G42 company (Abu Dhabi, UAE)

    Lead on LLM inference. Shipped a continuous-batching engine that cut P99 TTFT by 38% under multi-tenant load. Designed a speculative-decoding pipeline that delivered ~2.4× speedup on summarisation workloads.

    vLLMspeculative decodingCUDATriton
  • 2021 — 2023

    Computer Vision Specialist · AIQ, a G42 company (Abu Dhabi, UAE)

    Built and deployed real-time detection models for edge devices. Brought inference latency from 90ms → 28ms via pruning, distillation and INT8 quantization on Jetson-class hardware.

    edgequantizationTensorRTYOLO
  • 2021 — 2023

    ML Research Engineer · University Research Group

    Worked on CNN interpretability and feature-map visualization. Wrote tooling that became the default introspection layer for the group's vision projects.

    interpretabilityPyTorchvisualization
// projects

Things I've built & benchmarked.

All measured. All reproducible. Numbers come from a single A100 unless noted.

proj_01
speedup 2.4×

Speculative Decoding Bench

Benchmark harness for draft-target speculative decoding across draft sizes, acceptance thresholds, and workloads. Includes tree-attention sampling and adaptive lookahead.

LLMdecodingPyTorchTriton
proj_02
TTFT P99 84 ms

Tiny-vLLM

Minimal continuous-batching engine I wrote to understand vLLM internals. Paged KV-cache, prefill / decode separation, SSE streaming over a thin HTTP layer.

servingCUDAFastAPI
proj_03
size −74%

Edge-VLM Compression

Compressed a vision-language model to fit a 6GB Jetson budget while preserving 96% downstream accuracy via AWQ + LoRA finetuning.

VLMquantizationedgeLoRA
// contact

Got a hard inference problem? Let's talk.

Open to applied-research collabs, consulting on LLM inference, and full-time roles working on the runtime layer.

Abu Dhabi · UAE · UTC+4