Inside the Token Factory

A First-Principles Comparison of vLLM and SGLang

Who this is for: Whether you're building your own inference engine, evaluating vLLM vs SGLang for production, or simply want to understand how modern LLM serving works under the hood — this guide walks through every layer from first principles. We go deep on GPU architecture and systems design, but every novel concept (paged attention, RadixAttention, continuous batching, speculative decoding, disaggregated serving) is explained from scratch so you can follow along regardless of background.
How to read: Each of the 13 sections covers one layer of the serving stack — starting with the problem it solves, then showing how vLLM and SGLang each tackle it, with code references and comparison tables. Quizzes at the end of each section let you check your understanding. Feel free to jump to any section that interests you.
Codebase: vLLM at bd8bd52 (2026-04-11), SGLang at 78043d4 (2026-04-11).

1.

Front Door / API Layer

First Principles

An inference engine is, at its core, a GPU-bound pipeline: tokenize → schedule → compute → sample → detokenize. The front door's job is to accept HTTP requests, convert them into the engine's internal representation, and stream results back — all without becoming the bottleneck.

The critical insight is that the GPU is the scarce resource, not the CPU. A single H100 costs ~$30K and can process thousands of tokens per second, but only if it's kept busy. Every microsecond the GPU idles because the API layer is slow to enqueue the next request is wasted money. This creates a fundamental design tension:

Tokenization is CPU-bound and variable-latency. A 100K-token prompt takes orders of magnitude longer to tokenize than a 10-token prompt. If tokenization blocks the event loop, GPU scheduling stalls.

Streaming requires per-token delivery. Users expect Server-Sent Events (SSE) with sub-100ms inter-token latency. The API layer must multiplex thousands of concurrent SSE streams without head-of-line blocking.

Serialization overhead matters at scale. At 10K requests/second, the cost of JSON serialization, request validation, and inter-process marshalling becomes non-trivial. Binary protocols (msgspec, protobuf) outperform JSON by 5-10x for structured data.

Process isolation prevents GIL contention. Python's GIL means that tokenization, scheduling, and HTTP handling cannot truly parallelize within a single process. The solution is multi-process architecture with efficient IPC.

Design Space

An implementer faces these choices:

Decision	Options	Trade-off
Web framework	FastAPI, aiohttp, Starlette, raw asyncio	FastAPI gives OpenAPI docs + validation but adds overhead
Event loop	uvloop, default asyncio	uvloop is 2-4x faster for I/O-bound workloads
IPC mechanism	ZMQ, gRPC, Unix pipes, shared memory	ZMQ: low-latency, no schema; gRPC: typed, higher overhead
Serialization	msgspec, protobuf, pickle, JSON	msgspec: zero-copy struct packing; pickle: flexible but slow
Process model	Single-process, 2-process, 3-process	More processes = better isolation, higher IPC cost
Streaming	SSE, WebSocket, gRPC streaming	SSE: simplest, HTTP/1.1 compatible; WebSocket: bidirectional

vLLM

Architecture: 2-process model with ZMQ bridge

vLLM runs two OS processes:

Frontend process: FastAPI + uvicorn + uvloop. Handles HTTP, request validation, tokenization, and output assembly.
EngineCore process (EngineCoreProc): Owns the scheduler and GPU executor. Communicates with the frontend exclusively via ZMQ sockets.

The separation is clean: the frontend never touches GPU memory, and the engine core never handles HTTP.

Request flow (with code references):

HTTP POST /v1/chat/completions
  → api_router.py:53  create_chat_completion()
  → serving.py:230    render_chat_request()  — apply chat template, normalize
  → serving.py:315    engine_client.generate()
  → async_llm.py      AsyncLLM.generate()    — construct EngineCoreRequest
  → core_client.py    ZMQ send to EngineCore  — msgspec binary serialization
  → [EngineCore process]
  → scheduler.py      add_request()
  → ... (GPU compute) ...
  → core_client.py    ZMQ recv EngineOutput
  → output_processor  OutputProcessor — delta-merge, assemble RequestOutput
  → api_router.py:74  StreamingResponse(media_type="text/event-stream")

Key file: vllm/entrypoints/openai/api_server.py — the build_app() function (line 157) conditionally registers API routers based on supported_tasks. Each task type (generate, chat, transcription, realtime) has its own sub-router module.

Serialization: msgspec structs for EngineCoreRequest and EngineCoreOutput. msgspec is chosen because it supports zero-copy deserialization of typed structs — no schema negotiation, no reflection. The EngineCoreRequest (defined in vllm/v1/engine/__init__.py:80) carries prompt_token_ids, sampling_params, mm_features, arrival_time, priority, and other fields as fixed-layout binary.

Streaming implementation: The OutputProcessor (vllm/v1/engine/output_processor.py:45) uses RequestOutputCollector with asyncio.Event for non-blocking put/get. If the producer (engine) gets ahead of the consumer (HTTP stream), outputs are delta-merged to avoid queue bloat. This is important: a fast GPU can produce tokens faster than a slow client can consume them.

Authentication: Simple bearer-token via AuthenticationMiddleware (api_server.py:273). Set via --api-key or VLLM_API_KEY. No OAuth, JWT, or per-tenant auth.

Rate limiting: None built-in. The server accepts --middleware args for user-supplied middleware classes (line 302-312), so rate limiting is an external concern.

SGLang

Architecture: 3-process model with ZMQ IPC

SGLang separates into three process types:

HTTP server + TokenizerManager: FastAPI + uvicorn. Tokenizes requests and forwards to scheduler.
Scheduler: Owns the GPU worker, runs the scheduling loop, executes model forward passes.
DetokenizerManager: Converts output token IDs back to text, formats responses.

This 3-way split isolates tokenization and detokenization (both CPU-bound) from GPU scheduling (latency-critical). The scheduler process never blocks on string operations.

Key file: sglang/srt/entrypoints/http_server.py — all route registration, lifespan setup, and process orchestration.

Request flow:

HTTP POST /v1/chat/completions
  → http_server.py        OpenAIServingChat handler
  → tokenizer_manager.py  TokenizerManager.process_request()
  → ZMQ IPC               → DataParallelController (if dp_size > 1)
  → ZMQ IPC               → Scheduler process
  → scheduler.py          waiting_queue.append(req)
  → ... (GPU compute) ...
  → ZMQ IPC               → DetokenizerManager
  → detokenizer_manager.py  detokenize + format response
  → StreamingResponse     → HTTP SSE stream to client

Four API families in one server — SGLang uniquely serves:

Native: /generate, /encode, /flush_cache, /open_session
OpenAI: /v1/completions, /v1/chat/completions, /v1/embeddings, etc.
Ollama: /api/chat, /api/generate (for local-first tooling)
Anthropic: /v1/messages

Each family has its own serving handler that normalizes to internal GenerateReqInput structs (sglang/srt/managers/io_struct.py).

Multi-tokenizer mode: When --tokenizer-worker-num > 1, SGLang spawns multiple uvicorn worker processes sharing state via shared memory (read_from_shared_memory() at http_server.py:243). Each worker gets its own ZMQ IPC socket. This scales tokenization throughput linearly for high-QPS workloads.

HTTP/2 support: Optional via Granian server (--enable-http2). HTTP/2 multiplexing reduces connection overhead for many concurrent streams.

Authentication: Three auth levels via sglang/srt/utils/auth.py: NORMAL, ADMIN_OPTIONAL, ADMIN_FORCE. Admin endpoints (weight updates, profiling, cache flush) are gated separately from inference endpoints. API key set via --api-key and --admin-api-key.

Health check depth: SGLang's /health endpoint (http_server.py:506-578) sends a real generation request with max_new_tokens=1 to verify the full pipeline is functional. This is a deep health check — it validates tokenization, scheduling, GPU compute, and detokenization. Most frameworks only check process liveness.

Comparison

Aspect	vLLM	SGLang
Process model	2-process (frontend + engine core)	3-process (HTTP+tokenizer, scheduler, detokenizer)
IPC	ZMQ with msgspec binary	ZMQ with dataclass serialization
API families	OpenAI + Anthropic + gRPC + MCP	OpenAI + Anthropic + Ollama + native
Tokenizer scaling	Single-process tokenizer	Multi-tokenizer worker mode
HTTP/2	Not built-in	Granian option
Health check	Simple liveness	Deep end-to-end generation check
Auth model	Single API key	Tiered (normal + admin)
Rate limiting	None (external middleware)	None (external)
Streaming backpressure	Delta-merging in OutputProcessor	Backpressure via ZMQ HWM

IPC implementation detail: Both frameworks use ZMQ PUSH/PULL for request flow (one-directional, non-blocking) and PUB/SUB for stats broadcasting. SGLang additionally uses DEALER/ROUTER for the disaggregated P/D bootstrap handshake. HWM (high-water mark) settings control backpressure -- when the send HWM is reached, ZMQ drops or blocks depending on socket type.

When vLLM wins: Simpler deployment (fewer processes), lower IPC overhead for single-GPU setups, gRPC for internal service meshes.

When SGLang wins: High-QPS scenarios needing tokenizer parallelism, Ollama compatibility for local dev, deep health checks for Kubernetes readiness probes, admin API separation for multi-tenant platforms.

Quiz: Test Your Understanding

Q1 (Concept): Why do both frameworks use ZMQ rather than gRPC for internal IPC between the API process and the scheduler?

Q2 (Concept): SGLang uses a 3-process model while vLLM uses 2 processes. What's the specific benefit of separating the detokenizer into its own process?

Q3 (Calculation): An inference server receives 500 requests/second. Average prompt length is 2000 tokens. Tokenization throughput of a single CPU core using HuggingFace tokenizer is ~500K tokens/second. How many tokenizer worker processes does SGLang need to avoid tokenization becoming the bottleneck?

Q4 (Concept): vLLM's OutputProcessor uses "delta-merging" when the GPU produces tokens faster than the client consumes them. What would happen without this mechanism?

2.

Router / Admission Control

First Principles

In data-parallel (DP) serving, multiple engine replicas share the incoming request stream. A naive round-robin dispatcher treats every request equally, but LLM requests are not equal: a 32-token chat completion costs ~1000x fewer FLOPs than a 32,000-token document summarization. Without token-aware routing, one replica can be saturated with a single long-context request while another sits idle, causing head-of-line (HoL) blocking. The router must solve two problems: (1) distribute load proportional to actual cost, and (2) prevent starvation of queued requests when a high-priority request arrives.

Neither vLLM nor SGLang implements tenant isolation, per-user rate limiting, or request-level SLO routing at this layer. Both assume a cooperative, single-tenant deployment.

Design Space

Decision	Option A	Option B	Implication
Load signal	Queue length	Token count	Token count captures cost heterogeneity
Routing granularity	Per-request	Per-wave (batch)	Per-wave enables cooperative DP scheduling
State sharing	Pull (engine reports)	Push (coordinator broadcasts)	Push is lower latency but more traffic
Preemption at router	None (defer to scheduler)	Router-level priority	Router-level is simpler but coarser

vLLM

vLLM uses DPCoordinator (vllm/v1/engine/coordinator.py:23), a dedicated process that intermediates between front-end API servers and DP engine ranks.

Wave-based cooperative DP. Engines alternate between running and paused states. A global "request wave" counter tracks transitions from running to paused, synchronized via all-reduce in DPEngineCoreProc._has_global_unfinished_reqs. When all engines drain their queues, they collectively pause. When any engine receives a new request, the coordinator broadcasts START_DP_WAVE to resume all engines. This ensures all DP ranks participate in every batch, which is required for operations that involve cross-rank communication (e.g., tensor-parallel all-reduce within each DP group).

Load balancing is front-end side. The coordinator publishes per-engine stats (waiting queue length, running queue length) via ZMQ PUB to all front-ends (coordinator.py:148, EngineState.request_counts). Front-ends make local LB decisions using these stats. In "External LB mode," no stats are published and an external load balancer handles routing.

No priority preemption at the router. Priority is handled at the scheduler level via SchedulingPolicy.PRIORITY (request_queue.py:13), which uses a heap-based RequestQueue.

SGLang

SGLang uses DataParallelController (sglang/srt/managers/data_parallel_controller.py:116) with four pluggable load-balancing methods defined in LoadBalanceMethod (line 70):

ROUND_ROBIN -- Stateless, zero-overhead. Ignores cost heterogeneity.
TOTAL_TOKENS -- Routes to the rank with the fewest total tokens (prompt + generated). Uses DPBudget.dispatch() (line 99) with token count as primary key and request count as tiebreaker.
TOTAL_REQUESTS -- Routes to the rank with the fewest total requests.
FOLLOW_BOOTSTRAP_ROOM -- Used in disaggregated mode; routes to the rank with the most available bootstrap room for KV transfer.

DPBudget (line 87) maintains per-rank counters updated via WatchLoadUpdateReq messages from schedulers. After dispatch, it optimistically increments the target rank's request count by 1 to avoid burst routing to the same rank before the next stats update.

Priority preemption. SGLang supports priority-based preemption at the scheduler level via PrefillAdder.preempt_to_schedule() (schedule_policy.py:855). When a high-priority request arrives and insufficient budget exists, the scheduler iterates running requests sorted by (priority * -priority_sign, -wait_queue_entry_time) and preempts them. The configurable threshold priority_scheduling_preemption_threshold gates this behavior.

Comparison

Aspect	vLLM	SGLang
LB methods	External or stats-based (front-end)	4 built-in methods
Coordination	Wave-based (cooperative DP)	Independent dispatch
Stats transport	ZMQ PUB from coordinator	ZMQ from schedulers to controller
Priority preemption	Scheduler-level (heap queue)	Scheduler-level (sorted scan + preempt)
Disaggregated routing	Not at router level	FOLLOW_BOOTSTRAP_ROOM

vLLM's wave coordination wins when cross-rank synchronization (TP within DP) is required. SGLang's token-aware dispatch wins for heterogeneous workloads where request costs vary by orders of magnitude.

Quiz: Test Your Understanding

Q1 (Concept). Why does round-robin load balancing fail for LLM serving, even when all replicas have identical hardware?

Q2 (Calculation). You have DP=4. Current token counts per rank: [12000, 8000, 15000, 8000]. Current request counts: [5, 3, 6, 4]. A new request with 4000 prompt tokens arrives. Under SGLang's TOTAL_TOKENS policy, which rank receives it?

Q3 (Concept). Explain why vLLM's wave-based coordination is necessary when DP is combined with tensor parallelism.

3.

Global Scheduler

First Principles

INTERACTIVE

STATIC BATCHING — wasted slots in gray

CONTINUOUS BATCHING — no wasted slots

The scheduler is the brain of the serving engine. It decides which requests get GPU time at each iteration and how much. Three fundamental insights drive its design.

Insight 1: Static batching wastes GPU. In static batching, you pad all sequences to max_len and run them as a fixed batch. GPU utilization = sum(actual_tokens) / (B * max_len). For a batch of 32 requests where the shortest finishes at 50 tokens and the longest at 2048, utilization can drop below 10%. Continuous batching (Orca, Yu et al. 2022) removes this waste by operating at *iteration granularity*: after each forward pass, finished requests are evicted and new requests are admitted, so the GPU always processes actual tokens.

Insight 2: Prefill and decode have opposite resource profiles. For a transformer layer with hidden dim h, number of heads n_h, head dim d, and intermediate dim d_ff:

Prefill (processing P input tokens): FLOPs per layer ~ 2P * params_per_layer. For Llama-70B with GQA (64 query heads, 8 KV head groups, head_dim=128): attention linear params = Q [8192,8192] + K [8192,1024] + V [8192,1024] + O [8192,8192] = 67M + 8.4M + 8.4M + 67M ~ 151M; FFN params (SwiGLU gate+up+down) = 3 * 8192 * 28672 ~ 704M; total ~ 855M params/layer. FLOPs/layer ~ 2P * 855M. Memory traffic ~ 2 * 855M bytes (load weights once, FP16). Arithmetic intensity ~ P FLOPs/byte. For P=4096, this is massively compute-bound.

Decode (generating 1 token per request in batch B): FLOPs per layer ~ 2B * params_per_layer. Memory traffic is still ~ 2 * params_per_layer bytes (weights dominate). Arithmetic intensity ~ B FLOPs/byte. For B=1, this is catastrophically memory-bandwidth-bound.

Concrete example: Llama-70B with GQA (h=8192, d_ff=28672, 64 query heads, 8 KV head groups, head_dim=128, 80 layers, FP16). Attention linear params per layer: Q+O = 2 * 8192^2 = 134M; K+V = 2 * 8192 * 1024 = 16.8M; total attention ~ 151M. FFN params per layer: 3 * 8192 * 28672 = 704M. Total per layer: ~855M params, weight bytes = 2 * 855M = 1.71 GB. Total weight memory: 80 * 1.71 = 137 GB ~ the commonly cited ~140 GB for 70B FP16. H100 SXM: 989 TFLOPS FP16, 3.35 TB/s HBM bandwidth. Balance point: 989e12 / 3.35e12 ~ 295 FLOPs/byte. Decode at B=1 has intensity ~1, so it is 295x underutilizing compute. Prefill at P=4096 has intensity ~4096, so it is ~14x oversubscribing compute relative to bandwidth (compute-bound).

Insight 3: Mixed batching creates interference. A long prefill (e.g., 8192 tokens) in the same batch as decode requests adds seconds of latency to the decode tokens (which would otherwise complete in ~40ms). The scheduler must manage this interference.

Academic context. The continuous batching paradigm was introduced by Orca (Yu et al., OSDI 2022). Sarathi (Agrawal et al., 2023) introduced chunked prefill to mitigate prefill-decode interference. DistServe (Zhong et al., OSDI 2024) and Splitwise (Patel et al., ISCA 2024) formalized disaggregated prefill/decode architectures.

Design Space

Decision	Option A	Option B	Implication
Batch model	Unified (no phases)	Explicit EXTEND/DECODE	Unified is simpler; explicit enables per-phase optimization
Chunked prefill	Fixed chunk size	Page-aligned chunks	Page alignment avoids fragmentation
Budget tracking	Single token budget	Multi-budget (input, chunk, total)	Multi-budget prevents long prefills from starving decode
Preemption strategy	Full recompute	Partial reuse (radix tree)	Partial reuse saves FLOPs but adds complexity
CPU-GPU overlap	None	Overlap scheduling	Overlap hides CPU scheduling cost

Continuous Batching Mechanics

At each iteration, the scheduler:

Checks running requests: evict finished ones, free KV cache blocks.
Computes remaining token budget: max_batch_tokens - tokens_for_running_decode_requests.
Admits new requests (or continues chunked prefills) up to the budget.
Assembles the batch and dispatches to GPU.

The key invariant is that the total tokens processed in one forward pass never exceeds max_batch_tokens (the GPU's capacity for a single iteration).

Chunked Prefill

Without chunked prefill, a 32,000-token prompt monopolizes the GPU for the entire prefill duration, blocking all decode requests. Chunked prefill splits the prompt into chunks (e.g., 4096 tokens each) and interleaves them with decode iterations. The trade-off is chunk size vs. decode latency:

Large chunks: Better prefill throughput (fewer kernel launches, better GPU utilization per chunk) but higher decode latency spikes.
Small chunks: Lower decode latency but more overhead and potentially lower prefill throughput.

SGLang page-aligns chunks (schedule_policy.py:827): trunc_len = rem_chunk_tokens // page_size * page_size. This avoids partial-page allocation waste.

Overlap Scheduling (SGLang)

SGLang's event_loop_overlap (scheduler.py:1381-1433) pipelines CPU scheduling with GPU execution. The sequence per iteration is:

Receive requests from tokenizer (CPU).
Get next batch -- run the scheduling algorithm (CPU).
Launch current batch on GPU (async kernel launch).
Process previous batch results -- update KV cache metadata, check stop conditions, send outputs (CPU) -- *while the GPU is running the current batch*.
Launch sampling for the current batch if needed (depends on previous batch's grammar state).

The critical overlap: step 4 (CPU-heavy) runs concurrently with step 3's GPU execution. For workloads where CPU scheduling time is non-negligible (e.g., large radix tree operations, grammar checking), this reclaims CPU time that would otherwise be on the critical path. The is_disable_overlap_for_batch method (line 1435) disables overlap for consecutive prefill batches to optimize TTFT.

Adaptive Token Ratio (SGLang)

SGLang's new_token_ratio (schedule_policy.py:649-654) controls how much memory to reserve for future decode tokens. When admitting a prefill, the scheduler must reserve max_new_tokens * new_token_ratio KV slots for that request's future generated tokens. A ratio of 1.0 is maximally conservative (reserve for full generation length), while lower values allow more concurrent requests but risk OOM if many requests generate long outputs.

The _get_running_request_total_token_offset method (line 447) computes each running request's reservation: min(max_new_tokens - len(output_ids), CLIP_MAX_NEW_TOKENS) * new_token_ratio. This is accumulated into rem_total_token_offset, which is subtracted from available KV pool capacity to compute rem_total_tokens.

vLLM's Approach: Unified Compute Model

vLLM's scheduler (vllm/v1/core/sched/scheduler.py:348-358) explicitly avoids phase distinctions:

"There's no 'decoding phase' nor 'prefill phase' in the scheduler. Each request just has the num_computed_tokens and num_tokens_with_spec. At each step, the scheduler tries to assign tokens to the requests so that each request's num_computed_tokens can catch up its num_tokens_with_spec."

Scheduling loop. The schedule() method (line 348) processes in order:

Running requests (line 383): iterate self.running, compute num_new_tokens = num_tokens_with_spec + num_output_placeholders - num_computed_tokens, clamp to token_budget and long_prefill_token_threshold for chunking.
Waiting requests (from RequestQueue): pop from a deque (FCFS) or heapq (Priority) via request_queue.py:13-18.
Preemption: If KV allocation fails for a running request, preempt via num_computed_tokens = 0 (full recompute). The preempted request is re-queued.

Key classes:

Scheduler (scheduler.py:67) -- main scheduler.
RequestQueue (request_queue.py:20) -- abstract base; FcfsRequestQueue (deque) and PriorityRequestQueue (heapq).
SchedulingPolicy (request_queue.py:13) -- enum: FCFS or PRIORITY.

SGLang's Approach: Explicit EXTEND/DECODE with Multi-Budget

SGLang uses a mixin-based scheduler (sglang/srt/managers/scheduler.py) with explicit EXTEND (prefill) and DECODE modes.

PrefillAdder (schedule_policy.py:375) is the core budget manager with three simultaneous constraints:

rem_input_tokens (line 383): max new input tokens this iteration (controls prefill admission).
rem_chunk_tokens (line 384): max tokens for a single chunked prefill (controls chunk size).
rem_total_tokens (property, line 457): available KV pool capacity minus reservations for future decode tokens.

The add_one_req method (line ~600) checks all three budgets before admitting a request. If the request fits within rem_chunk_tokens, it is admitted as a full prefill. Otherwise, it is truncated to rem_chunk_tokens // page_size * page_size (page-aligned chunking, line 827).

Preemption with partial reuse. When preempting, SGLang's radix tree retains the computed KV cache prefix. The preempted request can later resume from the cached prefix rather than recomputing from scratch. This is a significant advantage over vLLM's full-recompute strategy for long-context requests.

Comparison

Aspect	vLLM	SGLang
Phase model	Unified (num_computed catching up)	Explicit EXTEND/DECODE
Budget tracking	Single `token_budget`	Triple: input, chunk, total
Chunked prefill	`long_prefill_token_threshold`	Page-aligned `rem_chunk_tokens`
Preemption cost	Full recompute (`num_computed_tokens=0`)	Partial reuse via radix tree
CPU-GPU overlap	Not in scheduler loop	`event_loop_overlap`
Decode reservation	Implicit (KV blocks allocated on demand)	Explicit `new_token_ratio` reservation
Scheduling policy	FCFS or Priority (heapq)	FCFS with priority preemption

vLLM's unified model is conceptually cleaner and naturally handles speculative decoding (spec tokens are just more tokens to "catch up"). SGLang's explicit phases enable more aggressive optimization: page-aligned chunking, triple-budget control, and overlap scheduling.

Quiz: Test Your Understanding

Q1 (Concept). Explain why continuous batching improves GPU utilization over static batching. What is the utilization formula for static batching with B requests where request i generates L_i tokens and the maximum is L_max?

Q2 (Calculation). Given max_batch_tokens = 8192, 10 running decode requests (each consuming 1 token per step), and a new 6000-token prefill request in the queue. Assume rem_chunk_tokens = 4096 and page_size = 16. Under SGLang's PrefillAdder:
(a) How many tokens are available for prefill input?
(b) How is the 6000-token request handled?

Q3 (Calculation). A request has max_new_tokens = 2000 and has already generated 500 tokens. new_token_ratio = 0.7. What is the token reservation for this request's future decode tokens?

Q4 (Concept). Why does vLLM's full-recompute preemption strategy become expensive for long-context requests, and how does SGLang's radix tree mitigate this?

Q5 (Concept). What CPU operations does SGLang's overlap scheduling hide, and why is overlap disabled for consecutive prefill batches?

4.

Prefill Tier

First Principles

Prefill is the phase where the model processes the full input prompt to populate the KV cache. It is compute-bound because every input token requires a full forward pass through all layers, and the arithmetic intensity scales with the sequence length.

FLOPs derivation for Llama-70B at 4096 input tokens. Per transformer layer, the dominant operations are the linear projections. With h=8192, d_ff=28672, GQA (64 query heads, 8 KV head groups, head_dim=128):

Attention linear projections: Q [8192,8192] + K [8192,1024] + V [8192,1024] + O [8192,8192] = 151M params. FLOPs = 2 * 4096 * 151M = 1.24T FLOPs
FFN (gate + up + down, with SwiGLU): 2 * 4096 * 3 * 8192 * 28672 = 5.8T FLOPs (3 projections for SwiGLU)
Attention (Q-K dot product + softmax-V): 2 * n_query_heads * seq_len^2 * head_dim = 2 * 64 * 4096^2 * 128 = 4.4T FLOPs. Note: GQA reduces KV *memory* (8 KV head groups share across 64 query heads) but does NOT reduce query-side attention compute -- all 64 query heads still compute their own Q-K dot products and softmax-V products.
Per layer total: 1.24T (linear) + 5.8T (FFN) + 4.4T (attention) = ~11.4T FLOPs. Across 80 layers: ~915T FLOPs.

H100 SXM delivers 989 TFLOPS FP16 (sustained ~750 TFLOPS). Time: 915T / 750T ~ 1.2 seconds. This is firmly compute-bound: the weight data loaded is ~137 GB, which at 3.35 TB/s takes ~41ms. The compute takes ~29x longer than the memory transfer.

Prefix sharing economics. In production, many requests share a system prompt (e.g., 2000 tokens). Without prefix sharing, N requests each pay the full 2000-token prefill cost. With prefix caching, the system prompt is computed once and the KV cache is reused, saving (N-1) * 2000 * cost_per_token FLOPs. For N=100, this is a 99% reduction in redundant compute for the shared prefix.

Disaggregated prefill. When the prefill workload is large enough, it can be beneficial to dedicate separate GPU instances to prefill vs. decode. The key trade-off: the KV cache must be transferred from prefill to decode instances. This is economical when transfer_time < scheduling_benefit. The scheduling benefit comes from eliminating prefill/decode interference entirely.

vLLM

vLLM does not have a separate prefill tier by default. All requests go through the unified scheduler. However, it provides a KV connector framework for disaggregated serving:

KVConnectorBase_V1 (vllm/distributed/kv_transfer/kv_connector/v1/base.py) defines the scheduler-side and worker-side primitives.
KVConnectorRole enum: SCHEDULER (metadata binding) and WORKER (actual KV transfer).
The connector provides start_load_kv(), wait_for_layer_load(), save_kv_layer(), and wait_for_save() for layer-pipelined KV transfer.
KVCacheEvent and KVEventBatch (kv_events.py) handle event notification.

SGLang

SGLang provides first-class disaggregated prefill via --disaggregation-mode prefill|decode.

Bootstrap handshake. PrefillBootstrapQueue (sglang/srt/disaggregation/prefill.py:87) manages the handshake between prefill and decode instances. It initializes a CommonKVManager with KV data pointers, page size, and transfer backend configuration. The prefill instance computes the KV cache and then transfers it to the decode instance.

KV events. EventBatch and KVCacheEvent (sglang/srt/disaggregation/kv_events.py:38-56) are msgspec-based structs for efficient serialization. Events are published via ZMQ PUB when blocks are stored, notifying decode instances that KV data is ready.

DecodeReqToTokenPool pre-allocation. The decode instance pre-allocates token pool slots before the KV transfer begins, avoiding allocation contention during transfer.

Transfer backends. SGLang supports multiple transfer backends including NCCL, NIXL (NVLink-optimized), and custom RDMA implementations.

Comparison

Aspect	vLLM	SGLang
Disaggregation support	KV connector framework (pluggable)	First-class `--disaggregation-mode` flag
KV transfer abstraction	Layer-pipelined (load/save per layer)	Block-level with bootstrap handshake
Event notification	KVEventBatch	EventBatch via ZMQ PUB
Prefix caching	Block-level hash matching	Radix tree (character-level granularity)
Transfer backends	Pluggable connectors	NCCL, NIXL, custom RDMA

Quiz: Test Your Understanding

Q1 (Calculation). Calculate the KV cache size for a single request with 8192 tokens on Llama-70B (80 layers, 8 KV heads per layer with GQA, head_dim=128, FP16). How long does transfer take on (a) NVLink 4.0 (900 GB/s), (b) InfiniBand NDR (400 Gb/s = 50 GB/s per port), (c) PCIe 5.0 x16 (64 GB/s)?

Q2 (Concept). Under what conditions does disaggregated prefill improve overall throughput compared to colocated prefill+decode?

Q3 (Concept). Explain why radix-tree prefix caching (SGLang) can share prefixes at finer granularity than block-level hash matching (vLLM).

5.

Decode Tier

First Principles

Decode is the autoregressive phase: one new token per request per step. It is memory-bandwidth-bound because each token requires loading the full model weights but performs minimal computation per loaded byte.

Arithmetic intensity derivation for Llama-70B at batch=1. Each parameter performs 2 FLOPs (multiply-add) and costs 2 bytes to load (FP16). So arithmetic intensity at batch B = (2*B) / 2 = B FLOPs/byte. At B=1: intensity = 1 FLOPs/byte. H100 balance point is ~295 FLOPs/byte (989T / 3.35T). Decode at B=1 is 295x below the compute saturation point.

The batch-size lever. Increasing batch size B scales FLOPs linearly (each request adds one token) without proportionally increasing memory traffic (weights are loaded once per step regardless of B). Intensity at batch B = B FLOPs/byte. Compute saturation: B = 295. Above B=295, decode becomes compute-bound on H100.

In practice, KV cache memory limits B long before compute saturation. For Llama-70B on a single H100 (80GB), after weights (140GB -- requires at least 2 GPUs), KV cache per token per layer is 2 * 8 * 128 * 2 = 4 KB, total per token 80 * 4 KB = 320 KB. With 40GB for KV cache (rough, on a 2-GPU setup), max tokens = 40GB / 320KB ~ 125K tokens. At average context length 2048, max B ~ 60 -- well below the compute saturation point of 295.

KV cache bandwidth overhead. The analysis above considers only weight-loading bandwidth. At longer contexts, KV cache reads add significant pressure. For seq_len=2048, 8 KV heads, head_dim=128, FP16: KV read per token per layer = 2 (K+V) * 2048 * 8 * 128 * 2 bytes = 8 MB. Across 80 layers per step: 640 MB per request. At B=64: 640 MB * 64 = ~40 GB of KV reads per step, on top of ~137 GB of weight reads -- roughly a 30% bandwidth overhead. This shifts the effective balance point lower than the weight-only estimate of B=295, meaning decode becomes compute-bound at a smaller batch size than the naive calculation suggests.

Max tokens/second at B=1. Time per step = weight_load_time = 137 GB / 3.35 TB/s = 40.9 ms. Throughput = 1 / 0.0409 ~ 24 tokens/second. This is the theoretical maximum for single-request decode latency on H100.

vLLM

vLLM uses the same unified scheduler for decode. There is no separate decode mode or tier.

Preemption = full recompute. When memory pressure forces preemption, vLLM resets num_computed_tokens = 0. The preempted request must re-prefill its entire prompt. For a 32K-token context, this costs the same as the original prefill (~0.75s on H100 for 70B).

Decode batching. Running requests are processed first in schedule() (line 383). Each running request contributes num_tokens_with_spec - num_computed_tokens tokens (typically 1 for standard decode, more with speculative decoding). The budget consumed by decode is minimal per request, leaving room for new prefill admissions.

SGLang

SGLang has an explicit DECODE forward mode that enables decode-specific optimizations.

Retraction (defer without freeing KV). When the scheduler needs to reduce batch size but doesn't want to lose computed KV cache, it can "retract" a request -- temporarily remove it from the active batch while keeping its KV cache in the radix tree. When re-admitted, the request resumes without any recomputation.

Preemption with partial reuse. When preemption is necessary (e.g., for a higher-priority request), the radix tree retains as much of the KV prefix as possible. The preempted request's KV cache is not immediately freed; it remains in the tree as evictable entries. If the request is rescheduled before eviction, it resumes from the cached state.

Decode-specific batching. In decode mode, SGLang can apply optimizations like CUDA graph capture for fixed batch sizes, avoiding kernel launch overhead for the small per-token computation.

Comparison

Aspect	vLLM	SGLang
Decode mode	Unified (no separate mode)	Explicit DECODE forward mode
Preemption cost	Full recompute	Partial reuse via radix tree
KV retention on preemption	Freed immediately	Retained as evictable in radix tree
Retraction (defer)	Not available	Supported
CUDA graph decode	Supported	Supported
Batch size control	Token budget based	Token budget + max_running_requests

Quiz: Test Your Understanding

Q1 (Calculation). For Llama-70B FP16 on H100 SXM (3.35 TB/s HBM, 989 TFLOPS): (a) What is the maximum decode throughput at B=1 in tokens/second? (b) At what batch size does decode become compute-bound? (c) What is the maximum decode throughput at B=256?

Q2 (Calculation). You have 40 GB of KV cache memory for Llama-70B (80 layers, 8 KV heads, head_dim=128, FP16). What is the maximum batch size at average context length 2048? At context length 512?

Q3 (Concept). Why is SGLang's retraction mechanism more efficient than vLLM's preemption for transient load spikes?

6.

KV-Cache System

First Principles

INTERACTIVE — click a logical block

LOGICAL VIEW

BLOCK TABLE

Logical	Physical
R1:0	Block 3
R1:1	Block 7
R1:2	Block 1
R2:0	Block 5
R2:1	Block 2

PHYSICAL (GPU HBM)

Click a logical block to trace its physical location

The KV cache stores key and value projections from every attention layer so that previously computed tokens are not recomputed during autoregressive decode. The fundamental allocation problem is that you do not know a request's output length at arrival time. If you reserve max_seq_len tokens of contiguous memory for every request, the internal fragmentation is:

waste = (max_seq_len - actual_seq_len) * per_token_kv_bytes

For a GQA model with 8 KV heads, head dimension 128, FP16, and 80 layers:

per_token_kv_bytes = 2 (K+V) * 8 heads * 128 dim * 2 bytes_fp16 * 80 layers
                   = 2 * 8 * 128 * 2 * 80 = 327,680 bytes = 320 KiB per token

If max_seq_len = 8192 and a typical request uses 1024 tokens, that is 7168 * 320 KiB = 2.24 GiB wasted per request. With 64 concurrent requests, waste exceeds 140 GiB -- more than any single GPU's HBM. Naive contiguous allocation is a non-starter at scale.

Paged Attention borrows the core idea from OS virtual memory: break the KV cache into fixed-size blocks (pages) and map logical token positions to physical blocks through a block table. Each block holds block_size tokens (typically 16). Per-block size:

block_bytes = block_size * per_token_kv_bytes = 16 * 320 KiB = 5 MiB

Allocation granularity drops from max_seq_len to block_size. Fragmentation is at most one partially-filled block per request -- under 5 MiB instead of 2+ GiB.

Capacity planning. For 80 GiB HBM, suppose model weights occupy 35 GiB (a 70B model at FP16 with TP=2 puts ~35 GiB of weights per GPU). Remaining KV budget: 45 GiB. Total blocks: 45 * 1024 / 5 = 9216 blocks = 147,456 tokens of KV capacity.

Design Space

Decision	Option A	Option B	Implication
Prefix dedup structure	Hash table (vLLM)	Radix tree (SGLang)	O(1) lookup vs O(k) traversal; tree supports mid-node splitting
Hashing granularity	Block-aligned only	Arbitrary prefix	Hash table cannot cache partial blocks; tree can split at any token
Eviction policy	Single LRU	Pluggable (LRU/LFU/SLRU/...)	More policies enable workload-specific tuning
Hash function	SHA256 / xxHash	Position-aware chained hash	Chained hash encodes position; content-only hash is position-agnostic
Block size	Larger (16-32)	Smaller (1-4)	Larger reduces table overhead but increases fragmentation

vLLM

Core data structures. The KVCacheBlock dataclass (vllm/v1/core/kv_cache_utils.py:110-155) holds block_id, ref_cnt, and an optional _block_hash (a BlockHashWithGroupId). The doubly-linked pointers prev_free_block / next_free_block are embedded directly on the block to avoid separate node allocations.

FreeKVCacheBlockQueue (kv_cache_utils.py:158-366) is a doubly-linked list with sentinel head/tail nodes. All operations -- popleft, remove, append -- are O(1) with zero Python object allocation (pointer swizzling on existing attributes). The ordering implements LRU: when a block is freed, it is appended to the tail. Eviction pops from the head (least recently used). Blocks with ref_cnt > 0 are never in the free list, preventing eviction of shared prefix blocks.

Hash-based prefix dedup. BlockHash is a NewType over bytes (kv_cache_utils.py:37). The hash function is configurable: SHA256 via CBOR serialization (hashing.py:43-58) or xxHash-128 (hashing.py:76-79). The hash input is the token IDs within the block plus optional extra keys (multimodal hashes, LoRA names, cache salts -- need_extra_keys at line 369). A BlockHashToBlockMap maps BlockHashWithGroupId -> KVCacheBlock, enabling O(1) lookup of whether a given block of tokens already exists in the cache.

Limitation: block-boundary alignment. If a cached prefix is 30 tokens and block_size=16, only the first block (tokens 0-15) is fully cached and hashable. Tokens 16-29 form an incomplete second block with no hash entry. A new request sharing 25 tokens of that prefix can only reuse the first block -- the remaining 9 shared tokens are recomputed. This is the fundamental cost of block-granularity hashing.

SGLang

RadixCache (sglang/python/sglang/srt/mem_cache/radix_cache.py) uses a radix tree (trie) where each edge is labeled with a sequence of token IDs and each node stores KV cache indices.

TreeNode (line 121-147) contains:

children: dict -- regular dict with explicit child creation in _split_node (using defaultdict would create ghost nodes on key lookup)
parent: TreeNode -- back-pointer for lock-ref propagation
key: RadixKey -- the token ID sequence on the edge from parent to this node
value: Optional[torch.Tensor] -- KV cache slot indices for this node's tokens
lock_ref: int -- reference count; when > 0, node is protected from eviction
hash_value: Optional[List[str]] -- page-level hashes for disaggregated KV transfer
hit_count, last_access_time, creation_time, priority -- metadata for eviction

Important implementation detail: TreeNode.key is typed RadixKey, not raw token IDs. RadixKey supports an extra_key field for multimodal/LoRA discrimination -- two sequences with identical tokens but different LoRA adapters or multimodal inputs will NOT share cache. The child lookup uses a configurable get_child_key_fn (see radix_cache.py:184).

Prefix matching (_match_prefix_helper, line 667-691): walks down the tree matching token IDs. If a match ends mid-node (the incoming key shares only a prefix of an edge label), _split_node (line 693-713) breaks the edge into two: one node for the shared prefix, one for the remainder. This means SGLang can cache and share prefixes at arbitrary token boundaries, not just block boundaries.

Fragmentation warning: _split_node (radix_cache.py:702) clones tensor values via .clone(), allocating new GPU memory for KV indices on every mid-node split. Under workloads with high prefix diversity (thousands of unique system prompts sharing partial prefixes), repeated splits create many small tensor allocations that fragment the KV pool. vLLM's block-aligned approach avoids this cost entirely by never splitting below block granularity.

Eviction strategies (evict_policy.py): seven pluggable strategies that return a priority tuple for heap-based eviction:

Strategy	Priority Key	Use Case
LRU (line 16)	`last_access_time`	General-purpose, default
LFU (line 22)	`(hit_count, last_access_time)`	Protect frequently reused system prompts
FIFO (line 27)	`creation_time`	Predictable, simple
MRU (line 32)	`-last_access_time`	Scan-resistant workloads
FILO (line 37)	`-creation_time`	Retain oldest (shared) prefixes
Priority (line 41)	`(priority, last_access_time)`	Application-set priority levels
SLRU (line 49)	`(is_protected, last_access_time)`	Two-segment: probationary + protected

Eviction (evict(), at radix_cache.py:582) uses a maintained evictable_leaves set (not full tree traversal) with a min-heap for priority ordering. Only leaf nodes with lock_ref == 0 are candidates. When a leaf is evicted, its parent may become a new leaf candidate if all children are gone. Lock-ref propagation (inc_lock_ref, line 611-624) walks from the matched node up to the root, incrementing lock_ref at each ancestor. This ensures no ancestor of an in-use node can be evicted.

Comparison

Dimension	vLLM	SGLang
Data structure	Flat hash table	Radix tree (trie)
Lookup complexity	O(1) amortized	O(k/page_size) where k = prefix length
Prefix sharing granularity	Block-aligned only	Arbitrary (mid-node split)
In-batch sharing	Via ref_cnt on shared blocks	Implicit -- multiple requests traverse same tree path
Eviction policies	LRU only (via free list ordering)	7 strategies (LRU, LFU, FIFO, MRU, FILO, Priority, SLRU)
Memory overhead per block	~120 bytes (dataclass + 2 pointers)	~300 bytes (TreeNode with children dict, metadata)
Hash function	SHA256-CBOR or xxHash-128	Position-aware chained hashing

When vLLM wins: High-throughput serving with uniform request patterns where block-aligned caching suffices and the O(1) lookup matters at scale. The lower per-block memory overhead also helps when block counts exceed millions.

When SGLang wins: Workloads with diverse, overlapping prefixes (multi-turn chat, few-shot prompting with varied examples) where arbitrary-boundary prefix sharing yields materially higher cache hit rates. The pluggable eviction strategies help when workload access patterns are non-LRU.

Quiz: Test Your Understanding

Q1 (Calculation). A Llama-70B model uses GQA with 8 KV heads, head_dim=128, 80 layers, FP16. The serving GPU has 80 GiB HBM. Model weights under TP=2 consume 35 GiB per GPU. Block size = 16. (a) What is the per-token KV size? (b) Per-block KV size? (c) How many blocks fit in the remaining HBM? (d) Maximum tokens of KV capacity?

Q2 (Concept). Why does content-only hashing (hashing just the token IDs in a block) break correctness with ALiBi or other position-dependent encodings, while it works for RoPE applied at attention time?

Q3 (Design). SGLang's radix tree splits nodes when a prefix match ends mid-edge. Explain why a hash table cannot support this operation, and what the concrete cache-miss cost is for a workload where 80% of requests share a 100-token prefix but block_size=16.

Q4 (Concept). In vLLM's FreeKVCacheBlockQueue, why are sentinel (fake) head and tail nodes used instead of null checks?

7.

Distributed Execution Layer

First Principles

INTERACTIVE

Columns split across ranks — all-reduce after output projection

A single GPU cannot hold the weights and KV cache of models beyond ~13B parameters at FP16 (26 GiB weights alone). Even for models that fit, the compute of a single GPU may be insufficient to meet latency targets. Distributed execution partitions computation across GPUs. The three axes of parallelism -- Tensor (TP), Pipeline (PP), and Expert (EP) -- address different bottlenecks.

Design Space

Decision	Option A	Option B	Trade-off
Primary parallelism	TP (partition within layer)	PP (partition across layers)	TP: low latency, high bandwidth need. PP: lower bandwidth, higher latency
MoE dispatch	All-to-all (EP)	Replicated experts	EP: memory efficient, load imbalance risk. Replicated: memory waste, no imbalance
Attention parallelism	Standard TP (split heads)	DP-Attention (split heads across DP ranks)	DP-Attention reduces KV memory but adds all-reduce per layer
Communication backend	NCCL all-reduce	Custom (DeepEP, NixlEP)	Custom kernels overlap compute/comm but increase complexity

Tensor Parallelism (TP)

Column-parallel linear. Weight matrix W of shape [d_in, d_out] is split along the output dimension. Rank r receives W[:, rd_out/T : (r+1)d_out/T]. The forward pass computes Y_r = X @ W_r locally with no communication. The output Y_r has shape [B, d_out/T]. For layers that need the full output (e.g., before a row-parallel layer), an all-gather collects [Y_0, ..., Y_{T-1}].

Row-parallel linear. W is split along the input dimension. Rank r receives W[rd_in/T : (r+1)d_in/T, :]. Input X must be split (or all-gathered) so each rank sees its slice. Each rank computes a partial sum Y_r = X_r @ W_r. An all-reduce sums the partial results: Y = sum(Y_r).

MHA head partitioning. For H attention heads with TP degree T, rank r gets heads [rH/T, (r+1)H/T). Each rank computes attention independently on its head subset. QKV projections use column-parallel; output projection uses row-parallel. The all-reduce after the output projection is the sole communication per attention layer.

Communication volume. Each all-reduce on a tensor of size M bytes using ring all-reduce transfers 2 * (T-1)/T * M bytes total across the ring. For hidden dimension d=8192, batch tokens B=4096, FP16:

M = B * d * 2 bytes = 4096 * 8192 * 2 = 64 MiB
Ring all-reduce volume = 2 * (T-1)/T * 64 MiB
For T=8: 2 * 7/8 * 64 = 112 MiB per layer
With 80 layers: 8,960 MiB = 8.75 GiB total all-reduce traffic per forward pass

At NVLink bandwidth of 900 GB/s bidirectional (H100), the per-layer all-reduce time for 112 MiB is 112 MiB / 900 GB/s = ~0.12 ms. With 2 all-reduces per layer (attention + FFN) across 80 layers, the total all-reduce time per forward pass is 0.12 ms x 80 x 2 = ~19 ms. This is acceptable for prefill (where compute takes hundreds of ms) but significant for decode where the total compute time may be only 2-5 ms per layer (~40 ms total across 80 layers at batch=256).

Pipeline Parallelism (PP)

Layers are partitioned into P stages across P GPUs. Stage s holds layers [sL/P, (s+1)L/P). The forward pass sends activations from stage s to stage s+1.

Micro-batching. Without micro-batching, pipeline bubble fraction = (P-1)/P (only one stage active at a time). With M micro-batches, bubble fraction drops to:

bubble = (P-1) / (P-1 + M)

For P=4, M=8: bubble = 3/11 = 27%. For P=4, M=32: bubble = 3/35 = 8.6%. In practice, PP is used primarily to scale beyond a single node's NVLink domain, accepting the latency cost for memory capacity.

Expert Parallelism (EP) for MoE

MoE models (DeepSeek-V3, Mixtral) replace dense FFN layers with a router that selects top-k experts per token. With E experts and EP degree T, each GPU holds E/T experts.

All-to-all dispatch. Each GPU routes its tokens to the GPU hosting the selected expert. This is a permutation communication pattern. The key challenge is load imbalance: if expert popularity is skewed, some GPUs receive disproportionately many tokens.

SGLang provides a rich set of dispatchers under sglang/srt/layers/moe/token_dispatcher/:

StandardDispatcher (standard.py:86): baseline all-to-all using NCCL
DeepEPDispatcher (deepep.py:731): uses DeepEP library for overlapped dispatch with normal and low-latency modes
MoriEPDispatcher (moriep.py:900): optimized EP dispatcher with RDMA-aware routing
NixlEPDispatcher (nixl.py:350): NIXL-based EP for cross-node dispatch
MooncakeEPDispatcher (mooncake.py:286): Mooncake transfer engine integration
FlashinferDispatcher (flashinfer.py:72): FlashInfer-based EP for single-node

Expert capacity factor and load balancing. In standard MoE implementations, each expert has a fixed capacity: the maximum number of tokens it can process per batch is capacity_factor * (total_tokens / num_experts). A typical capacity factor is 1.25, allowing 25% headroom above perfectly uniform routing. When an expert receives more tokens than its capacity, the excess tokens are *dropped* -- routed to a no-op that outputs zeros. Token dropping loses information and degrades output quality. To mitigate this, an auxiliary load-balancing loss is added during training, penalizing routing distributions that deviate from uniform. Without this loss, a small number of experts become disproportionately "popular" while most remain underutilized, creating both quality degradation (from dropping) and compute waste (idle experts). DeepSeek-V3 takes a different approach: it eliminates token dropping entirely, using dynamic allocation where experts process all routed tokens regardless of capacity. This avoids information loss but introduces variable-size all-to-all communication patterns, since the message sizes between GPU pairs depend on the runtime routing decisions rather than being fixed at capacity. The variable communication complicates NCCL collective scheduling and motivates custom dispatchers like DeepEP.

NCCL Tuning and Communication Optimization

NCCL algorithm selection. NCCL supports three collective algorithms, each optimal for different message sizes and topologies:

Ring: data circulates around a ring of GPUs. Bandwidth-optimal (achieves full bisection bandwidth) but latency grows linearly with GPU count. Best for small-to-medium messages on a single node.
Tree: binary tree reduction/broadcast. Latency grows logarithmically with GPU count, but bandwidth utilization is lower. Best for large messages on multi-node clusters where ring latency is prohibitive.
CollNet (SHARP): Scalable Hierarchical Aggregation and Reduction Protocol. Offloads reduction operations to InfiniBand switches themselves, so data is partially reduced in-network before reaching destination GPUs. This halves cross-node traffic for all-reduce (each direction carries half the data since reduction happens in the switch).

These are controlled via NCCL_ALGO=Ring|Tree|CollNet and NCCL_PROTO=Simple|LL|LL128 (protocol: Simple for large messages, LL/LL128 for low-latency small messages). In practice, NCCL auto-selects based on message size and topology, but manual tuning can yield 10-30% improvement for specific workload patterns.

GPU Direct RDMA (GDR). Standard GPU-to-GPU communication across nodes requires: GPU -> CPU memory (PCIe) -> NIC -> network -> NIC -> CPU memory (PCIe) -> GPU. GDR eliminates the CPU staging: the NIC reads from and writes to GPU memory directly via PCIe peer-to-peer access. This requires the NIC and GPU to be on the same PCIe switch (or root complex with ACS disabled). GDR reduces cross-node latency by 2-5 us and is critical for disaggregated prefill/decode architectures where KV cache transfers between prefill and decode GPUs must be low-latency. Enable with NCCL_NET_GDR_LEVEL=SYS (or finer: PIX for same PCIe switch, PXB for same PCIe bus, PHB for same NUMA node).

DP-Attention (SGLang Innovation)

Standard data parallelism replicates the full model and splits the batch. Each DP rank stores KV cache for its batch slice. For long-context workloads, the KV cache is the memory bottleneck, not model weights.

Insight. Instead of each DP rank storing KV for all attention heads (of its batch slice), split the heads across DP ranks. Rank r stores KV for heads [rH/D, (r+1)H/D) but for ALL requests in the batch. This reduces per-GPU KV memory by a factor of D (the DP-Attention degree).

Implementation (sglang/srt/layers/dp_attention.py). compute_dp_attention_world_info (line 237-251) computes the attention TP rank and DP rank from the global TP rank:

attn_dp_size = dp_size if enable_dp_attention else 1
attn_tp_size = tp_size // attn_dp_size // attn_cp_size
attn_tp_rank = tp_rank % attn_tp_size
attn_dp_rank = tp_rank // (attn_tp_size * attn_cp_size)

The rank layout is (dp, cp, tp) where tp is the fastest-changing dimension. DpPaddingMode controls how variable-length sequences are padded: MAX_LEN pads all sequences to the longest in the batch (simpler but wasteful), SUM_LEN pads to the sum (better utilization).

Trade-off. DP-Attention reduces KV memory linearly with D but adds an all-reduce after every attention layer (to recombine head outputs). For decode-heavy workloads with long contexts (>32K tokens), the KV memory savings dominate. For short-context, high-throughput workloads, the added communication is a net loss.

Scalability considerations. TP scales within a node (typically TP<=8 due to the NVLink domain). Beyond one node, PP and DP add scale: TP=8, PP=4, DP=4 serves 128 GPUs. The communication-to-computation ratio grows with scale: at TP=8, ~19 ms of all-reduce per step vs ~40 ms of compute (for decode at batch=256). Communication is ~32% of step time. At 1000+ GPUs, fault tolerance becomes critical -- neither framework supports mid-generation recovery from GPU failures. A TP rank failure is fatal to the entire TP group.

Comparison

Dimension	vLLM	SGLang
TP/PP	Full support, Megatron-style partitioning	Full support, same Megatron-style
EP dispatchers	NCCL-based all-to-all	6+ dispatchers (Standard, DeepEP, MoriEP, NixlEP, Mooncake, FlashInfer)
DP-Attention	Not natively supported	First-class support with configurable padding modes
Communication overlap	Via CUDA graph and async ops	DeepEP/MoriEP overlap compute and communication

Quiz: Test Your Understanding

Q1 (Symbolic). For a weight matrix W of shape [8192, 8192] with TP=4 using column-parallel partitioning, what slice does rank 2 receive?

Q2 (Symbolic). A model has 32 attention heads. With TP=4, which heads does each rank compute?

Q3 (Calculation). Calculate the all-reduce communication volume for one transformer layer with TP=8, hidden_dim=8192, batch_tokens=4096, FP16. How many such all-reduces occur per layer?

Q4 (Concept). Why does DP-Attention become increasingly beneficial as context length grows, but harmful for short-context workloads?

8.

Model Runtime / Executor

First Principles

The runtime layer bridges the model graph and the GPU hardware. Its two critical responsibilities are minimizing kernel launch overhead and selecting the fastest attention backend.

CUDA Graphs from First Principles

Every CUDA kernel launch traverses the CUDA driver stack: argument validation, kernel selection, stream synchronization check, and PCIe/NVLink doorbell. This costs approximately 3-10 microseconds per launch. A single decode step for a 70B model executes 100-300 kernels (attention, layernorm, linear, activation, rotary embedding -- per layer times 80 layers, plus sampling). Total launch overhead:

300 kernels * 5 us/launch = 1,500 us = 1.5 ms

A decode step's total GPU compute time is 2-5 ms for moderate batch sizes. Launch overhead is 30-75% of total time -- the GPU is idle waiting for the CPU to issue the next kernel.

Solution: CUDA Graphs. Record a sequence of kernel launches into a graph object. Replay the entire graph with a single launch call (15-50 us overhead depending on graph complexity -- node count, dependency chains -- regardless of kernel count). Savings:

Before: 300 * 5 = 1,500 us
After:  1 * 30 = 30 us (typical for moderate-complexity graphs)
Savings: 1,470 us per decode step = 1.47 ms

At 500 tokens/second decode rate, that is 0.735 seconds saved per second -- a 2.7-3x throughput improvement for decode-bound workloads.

The catch: CUDA graphs require fixed tensor addresses, shapes, and execution order. Variable batch sizes (the norm in serving) require either (a) capturing multiple graphs for different batch sizes, (b) padding to the nearest captured size, or (c) breaking out of the graph for dynamic sections.

vLLM uses CUDAGraphMode with eager (no graph), static (fully captured), and hybrid modes. It also leverages torch.compile with Inductor fusion passes (vllm/compilation/passes/) that fuse small kernels (e.g., RMS normalization + quantization, attention + quantization) to reduce total kernel count before graph capture.

SGLang implements breakable CUDA graphs: the graph can be exited mid-execution when a dynamic-shape operation is encountered, run that operation eagerly, then re-enter the graph. Piecewise capture subdivides the model into segments that are individually graphed. This achieves most of the graph benefit without requiring fully static shapes.

Attention Backends

The attention computation softmax(QK^T / sqrt(d)) V has O(N^2) memory for the attention matrix if materialized. FlashAttention avoids this by tiling:

FlashAttention-2: Tiles Q and KV into blocks that fit in SRAM. Computes partial softmax per tile, maintains a running max for numerical stability. Memory: O(N) instead of O(N^2). Throughput: near peak FLOPS on A100/H100 for long sequences.
FlashInfer: Optimized for the variable-length batched decode pattern common in serving. Its PageAttention kernel directly indexes into paged KV cache blocks, avoiding the scatter/gather overhead of converting paged layout to contiguous before calling FlashAttention.

vLLM: The attention selector uses a registry-based dispatch pattern (vllm/v1/attention/backends/registry.py), not direct compute-capability branching. Platform detection via cuda.py feeds into the registry, but the selector itself is configuration-driven. The configuration is captured in AttentionSelectorConfig (selector.py:21-47), which considers head size, dtype, kv_cache_dtype, MLA usage, sparse attention, and sink tokens. The registry maps these configurations to backends: on Hopper (H100), FlashInfer is preferred for its paged-decode kernels; on Ampere, FlashAttention-2 is the fallback.

SGLang supports 15+ attention backends including FlashInfer (default on CUDA), FlashAttention, Triton kernels, and domain-specific backends like NSA (Nucleus Sparse Attention) for models with learned sparsity patterns. The backend is selected at server startup based on hardware and model architecture.

FlashAttention vs FlashInfer paged KV handling. FlashAttention requires gathering paged KV blocks into a contiguous buffer before attention (or using block-diagonal masking to simulate contiguity). FlashInfer's BatchDecodeWithPagedKVCacheWrapper natively iterates the page table within the kernel, issuing loads directly from scattered physical pages. This avoids an O(batch * seq_len * head_dim) gather copy, which is significant at high batch sizes -- for batch=256 with 4K context on 128-dim heads, the gather alone is 256 * 4096 * 128 * 2 bytes = 256 MiB of unnecessary HBM traffic per attention layer.

Hopper TMA (Tensor Memory Accelerator). Hopper introduces hardware-accelerated asynchronous bulk copies from global memory to shared memory without occupying CUDA cores. The copy and compute stages are hardware-decoupled: the TMA unit operates independently of the warp scheduler. FlashAttention-3 leverages TMA to achieve higher occupancy: while Tensor Cores compute attention on one tile, TMA prefetches the next KV tile into shared memory. This producer-consumer pipeline is why FA-3 on Hopper outperforms FA-2 significantly -- the Tensor Cores never stall waiting for data, and the memory system is continuously utilized. On Ampere (no TMA), FA-2 must use software-managed shared memory loads, where warps alternate between issuing loads and computing, leaving either the memory system or Tensor Cores idle in each phase.

Key Kernel Fusions for Inference Throughput

The most impactful kernel fusions for LLM serving eliminate HBM round-trips between operations that are naturally sequential:

Fused RMSNorm + residual add + quantize: The baseline executes three kernels: (a) RMSNorm reads hidden states from HBM and writes normalized output, (b) residual add reads normalized output and residual, writes sum, (c) quantize reads FP16 result, writes INT8/FP8. The fused kernel reads hidden states and residual once, computes norm + add + quantize in registers, and writes the quantized output once -- eliminating two full HBM round-trips. In vLLM, this is handled by Inductor fusion passes (rms_quant_fusion.py). In SGLang, the fusion is implemented in the model forward pass.

**Fused SwiGLU gate * up:** The MLP's gating mechanism computes SiLU(x @ W_gate) * (x @ W_up). Without fusion, both gate and up projections are materialized to HBM as separate tensors, then an elementwise kernel reads both and writes the product. The fused kernel performs the elementwise multiply in registers immediately after the matmuls, avoiding materializing two full [batch, intermediate_dim] tensors.

Fused RoPE in attention: FlashInfer fuses rotary positional embedding directly into the attention kernel -- Q and K are rotated in registers as they are loaded, before the dot product. FlashAttention-2 requires a separate RoPE kernel that reads Q/K from HBM, applies rotation, and writes back, before the attention kernel reads them again. For long sequences, this extra read-write pass is non-trivial.

Fused top-k/top-p sampling: SGLang's sgl-kernel fuses the sort + cumulative sum + threshold comparison into a single pass over the logit distribution. vLLM uses separate PyTorch ops (sort, cumsum, masking, multinomial), each launching a separate kernel with its own HBM traffic.

Sampling Pipeline

After the final layer produces logits of shape [batch, vocab_size], the sampling pipeline converts them to token IDs:

vLLM (vllm/v1/sample/sampler.py:21-60):

Optional raw logprobs computation (for API response)
Cast to float32
Apply allowed token ID whitelist
Apply bad words exclusion (apply_bad_words)
Apply non-argmax-invariant logit processors (min_tokens, logit_bias)
Apply penalties: repetition, frequency, presence (apply_all_penalties)
Sample via TopKTopPSampler: temperature scaling, min_p, top_k, top_p, multinomial sample
Gather top-k logprobs for response

SGLang (sglang/srt/layers/sampler.py:41-100):

Custom logit processors + NaN detection
Greedy fast-path: if all requests are greedy, torch.argmax directly
Temperature scaling
Top-k renormalization via sgl_kernel.top_k_renorm_prob (AOT CUDA kernel)
Top-p renormalization via sgl_kernel.top_p_renorm_prob
Sampling via FlashInfer's top_k_top_p_sampling_from_probs or min_p_sampling_from_probs

SGLang's use of sgl-kernel for renormalization and FlashInfer for sampling avoids Python-side probability manipulation, keeping the hot path in fused CUDA kernels.

Comparison

Dimension	vLLM	SGLang
CUDA graphs	Static capture, torch.compile fusion	Breakable/piecewise graphs
Attention backends	Platform-aware auto-selection	15+ backends, manual or auto
Sampling kernels	PyTorch + custom TopKTopP	sgl-kernel AOT + FlashInfer
Kernel fusion	Inductor IR passes (compilation/passes/)	JIT kernel library (jit_kernel/)
Greedy fast-path	Part of unified sampler	Separate torch.argmax branch

Quiz: Test Your Understanding

Q1 (Calculation). A decode step has 250 kernel launches at 5 us each. With CUDA graphs (30 us typical replay overhead for a moderate-complexity graph), what is the absolute time saved? If total decode step time was 4 ms, what is the relative speedup?

Q2 (Concept). Why can CUDA graphs not handle variable batch sizes, and what are the two workarounds?

Q3 (Concept). FlashAttention achieves O(N) memory instead of O(N^2) for attention. Where does the N^2 come from in naive attention, and how does tiling eliminate it?

9.

Quantization + Kernel Backend

First Principles

LLM inference is memory-bandwidth-bound during decode: the GPU must read the entire weight matrix for each token generated, but each token only performs one matmul per weight element (arithmetic intensity ~1). Reducing weight precision from FP16 (2 bytes) to INT4 (0.5 bytes) cuts memory bandwidth by 4x, directly translating to ~4x decode throughput for bandwidth-bound workloads.

Why quantization works. Trained weight distributions are approximately Gaussian with most values clustered near zero. The information content per weight is far less than 16 bits. Empirically, 4-bit quantization with group-wise scaling preserves >99% of model quality for most tasks.

INT4 group quantization (group_size=128). Group 128 consecutive weights. Compute scale = max(abs(group)) / 7 (for symmetric 4-bit: range [-7, 7]). Quantize: w_int4 = round(w_fp16 / scale). Dequantize: w_fp16 = w_int4 * scale. Scale factor overhead: 1 FP16 value (2 bytes) per 128 weights = 0.015625 additional bytes per weight.

Model size calculation. 70B parameters at FP16 = 70B * 2 = 140 GiB. At INT4 with group_size=128:

Weight bytes = 70B * 0.5 bytes = 35 GiB
Scale bytes  = (70B / 128) * 2 bytes = 1.09 GiB
Total = 36.09 GiB (fits in a single 80 GiB GPU)

Compare: FP16 requires 140 GiB = minimum 2 GPUs with TP=2.

Format Comparison

Format	Bits	Structure	Range	Best For
FP16	16	1+5+10 (sign+exp+mantissa)	+/-65504	Baseline, training
BF16	16	1+8+7	+/-3.4e38	Training, same range as FP32
FP8 E4M3	8	1+4+3	[-448, 448]	Activations, KV cache
FP8 E5M2	8	1+5+2	[-57344, 57344]	Gradients (wider range)
FP4 E2M1	4	1+2+1	[-6, 6]	Aggressive weight compression
INT4 (AWQ)	4	Symmetric/asymmetric integer	[-8, 7] or [0, 15]	Weight-only quantization
INT4 (GPTQ)	4	+ Hessian error correction	[-8, 7]	Higher accuracy than naive INT4

AWQ (Activation-aware Weight Quantization): Observes that a small fraction of weight channels are "salient" -- they correspond to large activation magnitudes and disproportionately affect output quality. AWQ protects these channels by scaling them up before quantization (equivalently, absorbing the scale into preceding layers), then using standard group quantization. This preserves the critical channels at the cost of slightly worse quantization for non-salient channels.

GPTQ: Uses second-order (Hessian) information to quantize weights one column at a time, redistributing quantization error to not-yet-quantized columns. This is optimal in the L2 sense for each column given the Hessian of the layer's reconstruction loss. More expensive to calibrate than AWQ but often yields better perplexity at the same bit width.

Marlin kernels: Hand-optimized GPU kernels that fuse INT4 dequantization with matrix multiplication. Instead of dequantizing weights to FP16 in a separate kernel and then calling cuBLAS, Marlin reads INT4 weights, dequantizes in registers, and immediately accumulates in FP16/FP32. This eliminates the intermediate FP16 weight tensor from HBM, achieving near-peak memory bandwidth utilization.

Tensor Core data type requirements and hardware generations. Not all quantization formats map directly to Tensor Core operations:

INT4 weight-only quantization (AWQ, GPTQ): Weights are stored as INT4 in HBM, but computation happens in FP16 or FP32 on Tensor Cores after software dequantization. The Tensor Core itself does not perform INT4 matrix multiply -- the INT4 format saves memory bandwidth (reading 0.5 bytes instead of 2 bytes per weight), but the arithmetic throughput is identical to FP16. The speedup comes entirely from being memory-bandwidth-bound: 4x less data to read = up to 4x faster.

FP8 (E4M3 for forward, E5M2 for gradients): Requires Hopper's 4th-generation Tensor Cores, which natively support FP8 matrix multiply-accumulate. On Hopper, FP8 achieves 2x the FLOPS of FP16 (3,958 TFLOPS vs 1,979 TFLOPS on H100 SXM). FP8 Tensor Cores are *not available on Ampere* -- on A100, FP8 weights must be dequantized to FP16 before hitting Tensor Cores, providing bandwidth savings but no compute advantage.

FP4 (E2M1): Native Tensor Core support exists only on Blackwell (B100/B200), where FP4 achieves 2x the FLOPS of FP8. On Hopper, FP4 requires software dequantization to FP8 before reaching Tensor Cores. This means FP4 on Hopper saves memory bandwidth (half the bytes of FP8) but does *not* provide 2x compute throughput over FP8 -- the Tensor Cores still operate at FP8 speed. The practical benefit on Hopper is fitting larger models in HBM and reducing memory-bandwidth bottlenecks, not increasing arithmetic throughput.

vLLM

vLLM supports 12+ quantization formats under vllm/model_executor/layers/quantization/: AWQ (awq.py), AWQ-Marlin (awq_marlin.py), GPTQ (gptq.py), GPTQ-Marlin (gptq_marlin.py), FP8 (fp8.py), MXFP4 (mxfp4.py), BitsAndBytes (bitsandbytes.py), ModelOpt (modelopt.py), TorchAO (torchao.py), and more. The base_config.py defines the QuantizationConfig interface that all formats implement.

Kernel fusion is handled through Inductor compilation passes (vllm/compilation/passes/). These operate on the torch.compile IR to identify and fuse patterns like RMSNorm+Quantize, Attention+Quantize, and MLA+Attention+Quantize into single kernels, reducing launch overhead and intermediate memory traffic.

SGLang

SGLang takes a two-tier approach:

AOT kernels (sgl-kernel): Performance-critical kernels are implemented as ahead-of-time compiled CUDA/C++ extensions in the sgl-kernel package. These include top_k_renorm_prob, top_p_renorm_prob, and quantization primitives, compiled once at install time for maximum performance.

JIT kernels (sglang/jit_kernel/): A library of lightweight Triton and CUDA kernels compiled at first use. Quantization-related kernels include:

awq_dequantize.py: AWQ INT4 dequantization
per_tensor_quant_fp8.py: per-tensor FP8 quantization
nvfp4.py: NVIDIA FP4 quantization
per_token_group_quant_8bit.py: per-token group INT8 quantization

The JIT approach enables hardware-specific tuning (kernel parameters are auto-tuned for the current GPU) without requiring a heavy build infrastructure.

Comparison

Dimension	vLLM	SGLang
Supported formats	12+ (broadest coverage)	Focused set + sgl-kernel custom ops
Kernel strategy	Marlin + Inductor fusion passes	sgl-kernel AOT + Triton JIT
FP8 support	Full (E4M3, E5M2, per-tensor, per-channel)	Full + per-token group quantization
INT4 support	AWQ, GPTQ, AWQ-Marlin, GPTQ-Marlin	AWQ (JIT dequantize), GPTQ-Marlin
MXFP4/NV-FP4	Yes (mxfp4.py)	Yes (nvfp4.py JIT kernel)
Kernel auto-tuning	Via torch.compile Inductor	Triton auto-tuner in JIT kernels

When vLLM wins: When you need to deploy a model in an unusual quantization format (BitsAndBytes, GGUF, TorchAO, etc.). vLLM's breadth of format support is unmatched.

When SGLang wins: When peak kernel performance matters more than format coverage. The sgl-kernel AOT path avoids JIT compilation latency and Python dispatch overhead for the most critical operations.

Quiz: Test Your Understanding

Q1 (Calculation). A 70B-parameter model uses INT4 quantization with group_size=128 and FP16 scales. (a) What is the total model size? (b) What is the effective bits-per-weight including scale overhead? (c) How much HBM is saved compared to FP16?

Q2 (Concept). Why is per-channel quantization (one scale per output channel) better than per-tensor quantization (one scale for the entire weight matrix)?

Q3 (Concept). Why do Marlin kernels achieve higher throughput than the naive "dequantize to FP16 then cuBLAS GEMM" approach?

10.

Speculation Layer

First Principles

INTERACTIVE

Click "Run" to animate draft→verify

Autoregressive decoding is fundamentally sequential: generating token $i$ requires the hidden states produced while generating token $i-1$. For a sequence of $N$ tokens, this means $N$ serial forward passes through the entire model. Each pass is memory-bandwidth-bound during decode (batch size per sequence = 1 token), meaning the arithmetic units sit mostly idle while weights stream from HBM. The GPU processes one token but pays the full cost of loading every parameter.

The core insight behind speculative decoding is that verification is cheaper than generation. If a cheap draft model proposes $K$ candidate tokens, the expensive target model can verify all $K$ in a single forward pass because the attention over those $K$ candidates is fully parallelizable (each candidate at position $i$ attends to the prefix plus candidates $0..i-1$). When the draft model has acceptance rate $\alpha$ per token, we get $K$ tokens verified for approximately 2 forward-pass equivalents (one draft, one verify), yielding a speedup approaching $K/2$ in the best case.

Expected accepted tokens (linear chain). With $K$ draft tokens each accepted independently with probability $\alpha$, the number of accepted tokens before the first rejection follows a truncated geometric distribution. The expected count is:

$$E[\text{accepted}] = \sum_{i=0}^{K-1} \alpha^i = \frac{1 - \alpha^K}{1 - \alpha}$$

Note this is *not* $\alpha/(1-\alpha)$ (the untruncated geometric mean) -- the truncation to $K$ terms matters. For $\alpha=0.8, K=5$: $E = (1 - 0.8^5)/(1 - 0.8) = (1 - 0.32768)/0.2 = 3.36$ tokens per verify step, plus one bonus token from the target's own sampling at the first rejected position.

Design Space

Decision	Option A	Option B	Trade-off
Draft source	Separate small model	Target model's own heads (EAGLE/MTP)	Separate model = more flexible, EAGLE = no extra weights to load
Token structure	Linear chain	Tree (branching)	Tree explores more paths but needs custom attention masks
Rejection method	Strict (argmax match)	Probabilistic (distribution-preserving)	Strict = simpler, probabilistic = exact target distribution
Draft-verify overlap	Sequential	Pipelined (V2)	Pipelining hides draft latency but adds implementation complexity
Speculation depth ($K$)	Small (3-5)	Large (8-16)	Larger $K$ = more compute per step, but diminishing returns if $\alpha$ is low

The Draft-Verify Paradigm

Strict rejection. The simplest method: the target model samples independently at each draft position. If its sample matches the draft token, accept; otherwise reject and discard all subsequent draft tokens. This does NOT preserve the target distribution exactly when using stochastic sampling (it biases toward tokens the draft model also predicts), but it is correct for greedy decoding.

Probabilistic rejection sampling (distribution-preserving). For each draft position with draft token $d$:

Compute acceptance probability: $a = \min\!\bigl(1,\; p(d) / q(d)\bigr)$, where $p$ is the target probability and $q$ is the draft probability.
Draw $u \sim \text{Uniform}(0,1)$. If $u < a$, accept $d$.
If rejected, sample a correction token from the residual distribution: $r(x) \propto \max(0,\; p(x) - q(x))$.

This guarantees the output distribution is *identical* to sampling from the target model alone -- there is zero quality degradation. The proof follows from the fact that the marginal probability of emitting any token $x$ is $q(x) \cdot \min(1, p(x)/q(x)) + (1 - \sum_y q(y)\min(1,p(y)/q(y))) \cdot r(x) = p(x)$.

vLLM implementation. RejectionSampler in vllm/v1/worker/gpu/spec_decode/rejection_sampler.py supports three modes selected by rejection_sample_method:

strict: Triton kernel _strict_rejection_sample_kernel (line 24). One Triton program per request. Iterates draft positions, stores the target's sample, and sets rejected=True on first mismatch. Clean and branchless inside the GPU warp.
probabilistic: Delegates to probabilistic_rejection_sample() which processes target logits and draft logits jointly, computing the acceptance ratio per token with temperature-adjusted distributions.
synthetic: For testing/benchmarking. Uses compute_synthetic_rejection_sampler_params() to derive a base acceptance rate and decay factor, then simulates acceptance without real draft logits.

Implementation note: vLLM has TWO RejectionSampler classes: (1) vllm/v1/sample/rejection_sampler.py:30 (nn.Module, used during normal sampling with spec decode) and (2) vllm/v1/worker/gpu/spec_decode/rejection_sampler.py:100 (worker-level, contains the Triton kernels). The worker-level sampler contains the actual Triton acceptance kernels; the sample-level one wraps it with logprob handling.

Tree-Structured Speculation

A linear chain wastes all tokens after the first rejection. Tree speculation hedges by branching: at each depth level, multiple candidate tokens are proposed. If the top-1 at depth 2 is rejected, the top-2 might still match.

Tree attention masks. Each node in the tree attends only to its ancestors. For a tree with branching factor $[3, 2]$ (3 children at depth 1, 2 grandchildren per child at depth 2), there are $3 + 6 = 9$ candidate positions, but each depth-2 node attends to only 3 prior tokens (root + its depth-1 parent + itself), not all 9.

vLLM tree implementation. The EagleProposer (in vllm/v1/spec_decode/eagle.py) parses speculative_token_tree into tree_choices -- a list of tuples defining the tree topology (line 275). It precomputes cu_drafts_per_level (cumulative node counts per depth) and child_drafts_per_level (branching factor per level) at lines 278-289. Tree attention is handled by TreeAttentionMetadataBuilder which constructs the causal mask restricting each query to its ancestor path.

EAGLE: Draft from Target Hidden States

EAGLE avoids training or loading a separate draft model entirely. The key observation: the target model's last hidden state at position $i$ already encodes rich information about token $i+1$. An EAGLE draft head takes this hidden state, applies a lightweight transformation (typically 1-2 transformer layers), and predicts the next token.

EAGLE3 extends this by incorporating auxiliary hidden states from intermediate layers, improving prediction quality at the cost of slightly more draft computation.

SGLang's EAGLE implementation is the most mature in the ecosystem:

EAGLEWorker (sglang/srt/speculative/eagle_worker.py): The V1 worker. Extends TpModelWorker, shares the token-to-KV-pool allocator with the target worker (line 115-117). Initializes the draft model with the target's embedding and LM head via get_embed_and_head() (line 158). Supports both EAGLE and EAGLE3 (lines 120-176), with EAGLE3 models sometimes sharing the LM head from the target.
EAGLEWorkerV2 (sglang/srt/speculative/eagle_worker_v2.py): Overlaps draft generation with target verification. While the target verifies batch $t$, the draft model speculatively generates candidates for batch $t+1$ using the predicted hidden states. This pipelining hides the draft model's latency almost entirely.
EAGLEDraftCudaGraphRunner (sglang/srt/speculative/eagle_draft_cuda_graph_runner.py): CUDA graph capture for the draft model's forward pass, eliminating kernel launch overhead that would otherwise dominate the lightweight draft computation.
organize_draft_results() in eagle_utils.py (line 19): Takes per-step score, token, and parent lists, flattens and sorts by top-K scores to select the best num_draft_token - 1 candidates across all tree branches.
build_tree_kernel_efficient() (line 47): Constructs the verification tree mask. Supports three modes via TreeMaskMode: FULL_MASK (dense boolean), QLEN_ONLY (query-length dimension only), and QLEN_ONLY_BITPACKING (packed bits for memory efficiency with large trees).

Multi-Token Prediction (MTP / Parallel Drafting)

MTP adds multiple prediction heads branching off the final transformer layers. Head $i$ predicts the token at position $+i$ relative to the current position. All heads execute in a single forward pass, producing $K$ candidate tokens simultaneously without any autoregressive draft steps.

vLLM's parallel drafting. The SpecDecodeBaseProposer (line 87-96 of eagle.py) sets parallel_drafting=True and computes extra_slots_per_request = num_speculative_tokens (each request needs $K$ additional KV cache slots for the draft positions). The _init_parallel_drafting_params() method (line 317) reads a mask token ID from the draft model config -- this padding token fills positions where the parallel head has no meaningful input, used in models like DFlash that natively support this mode.

Comparison

Dimension	vLLM	SGLang
Primary method	EAGLE + tree attention	EAGLE + tree via `sgl_kernel`
Rejection sampling	Triton kernels (strict, probabilistic, synthetic)	Custom CUDA verification in `build_tree_kernel_efficient`
Draft-verify overlap	Single-pass (sequential)	V2 worker overlaps draft + verify
Tree mask	`TreeAttentionMetadataBuilder`	`TreeMaskMode` with bitpacking option
Parallel drafting (MTP)	Native via `parallel_drafting` flag	Not yet exposed as primary path
EAGLE3	Supported (`Eagle3DeepseekV2ForCausalLM`, `Eagle3LlamaForCausalLM`)	Supported with auto-detected `hot_token_id`
CUDA graph for draft	Dispatched via `CudagraphDispatcher`	Dedicated `EAGLEDraftCudaGraphRunner`

When vLLM wins: Probabilistic rejection sampling for quality-sensitive workloads; MTP/parallel drafting for models that natively support it; broader hardware backend for spec decode (ROCm).

When SGLang wins: EAGLE V2 overlap hides draft latency nearly for free; the bitpacked tree mask scales to larger trees with less memory; tighter integration between draft and target through shared KV pools.

Quiz: Test Your Understanding

Q1 (Calculation). For a linear speculation chain with $K=5$ draft tokens and per-token acceptance rate $\alpha=0.8$, compute the expected number of accepted tokens (excluding the bonus token from target sampling at the rejection point).

Q2 (Concept). In probabilistic rejection sampling, why do we sample from $r(x) \propto \max(0, p(x) - q(x))$ when the draft token is rejected, rather than simply sampling from $p(x)$?

Q3 (Calculation). A tree speculation setup uses branching factor $[3, 2, 2]$ (3 children at depth 1, 2 at depth 2, 2 at depth 3). How many total candidate tokens are verified? If each edge has independent acceptance probability $\alpha=0.7$, what is the probability that at least one path of depth 3 is fully accepted?

Q4 (Design). Explain why EAGLE V2's overlap strategy does not violate the data dependency between draft and verify phases.

11.

Memory / Storage Hierarchy

First Principles

The memory wall dictates everything about LLM serving economics. An H100 has 80 GB of HBM3 at 3.35 TB/s. A 70B parameter model in FP16 occupies 140 GB -- it does not fit on a single GPU. With TP=2, each GPU holds ~70 GB of weights, leaving roughly 10 GB for KV cache, activations, and framework overhead.

KV cache budget calculation. For a model with $L=80$ layers, $H=8$ KV heads (GQA), head dimension $d=128$, FP16 (2 bytes):

Per-token KV: $2 \times L \times H \times d \times 2 = 2 \times 80 \times 8 \times 128 \times 2 = 327{,}680$ bytes $\approx 320$ KB.

With 7 GB available: $7 \times 1024^3 / 327{,}680 \approx 22{,}900$ tokens of KV cache total across all active requests.

At 4K context length, that supports only ~5 concurrent requests. This is why tiered memory matters: moving cold KV entries to CPU or storage can 10x the effective cache capacity.

Design Space

Decision	Option A	Option B	Trade-off
Number of tiers	2 (GPU + CPU)	3+ (GPU + CPU + storage)	More tiers = more capacity but higher complexity + latency for cold hits
Transfer mechanism	Synchronous cudaMemcpy	Async JIT CUDA kernels	Async avoids blocking compute stream
Eviction policy	LRU	Hit-count threshold	Threshold avoids thrashing with bursty access patterns
CPU memory type	Pageable	Pinned (page-locked)	Pinned = faster DMA but reduces host memory available to OS

GPU copy engines (DMA engines) and compute-copy overlap. GPU copy engines are dedicated hardware units separate from compute SMs. They handle memory transfers (host-to-device, device-to-host, device-to-device) independently, meaning the SMs can execute kernels while a copy engine simultaneously transfers data. cudaMemcpyAsync on a non-default stream enables this true compute-copy overlap: the SMs run inference kernels on the compute stream while the copy engine transfers KV cache blocks on a separate transfer stream.

Pinned (page-locked) host memory is required for async DMA transfers because copy engines perform physical DMA and cannot handle virtual memory page faults. If the source/destination is pageable memory, the CUDA driver must first copy it to an internal pinned staging buffer before the DMA engine can transfer it -- effectively doubling the transfer time and serializing the operation. This is why both vLLM's CuMemAllocator (which backs up GPU memory to CPU tensors during sleep mode) and SGLang's HostKVCache use pin_memory=True for their CPU-side buffers: it ensures that KV cache offloading and loading can overlap with GPU compute without stalling either the SMs or the copy engine.

vLLM

vLLM implements a two-tier system: GPU HBM and CPU pinned memory.

CuMemAllocator (vllm/device_allocator/cumem.py): A custom CUDA memory allocator built on cuMemCreate/cuMemMap virtual memory APIs. It intercepts PyTorch's allocation path via CUDAPluggableAllocator (line 71), routing through my_malloc/my_free C entry points.

The key feature is sleep/wake mode: when a vLLM instance is idle (e.g., waiting for requests in a multi-tenant deployment), sleep() unmaps GPU memory and backs it up to CPU tensors (AllocationData.cpu_backup_tensor, line 55). wake() re-creates GPU allocations and restores from the CPU backup. This enables GPU memory sharing across multiple model instances without reloading weights from disk.

SimpleCPUOffloadConnector: Manages explicit KV cache offloading to CPU pinned memory. Straightforward two-tier design -- blocks are either on GPU or CPU, with synchronous transfer on cache miss.

SGLang's Approach: HiCache (Hierarchical Cache)

SGLang implements a multi-tier caching system with three or more levels.

Tier 1: GPU (hot). Standard paged KV cache in HBM. Managed by the radix tree prefix cache.

Tier 2: CPU pinned (warm). HostKVCache (sglang/srt/mem_cache/memory_pool_host.py). Pinned memory pool for recently evicted KV entries.

Tier 3: Storage backends (cold). HiCacheStorageConfig and the abstract HiCacheStorage base class (sglang/srt/mem_cache/hicache_storage.py) define a pluggable storage interface. Concrete backends include:

NIXL (hicache_nixl.py): RDMA-based transfer for multi-node KV sharing.
Mooncake: Object store backend for persistent KV across restarts.
LMCache: External caching service integration.

Async transfer via JIT CUDA kernels. transfer_hicache_one_layer() in sglang/jit_kernel/hicache.py (line 73) is a JIT-compiled CUDA kernel for layer-granularity KV transfer. The kernel is parameterized by element_size, unroll factor, and block_quota (how many CUDA blocks to use -- lower values reduce interference with compute kernels). It supports both standard KV layout and MLA (Multi-head Latent Attention) layout via separate run_one_mla / run_all_mla entry points.

Tier migration policies. In HiRadixCache (sglang/srt/mem_cache/hiradix_cache.py):

write_through_threshold (default 1 for write-through, 2 for write-back): A node is backed up to host memory only after hit_count >= write_through_threshold (line 667). This avoids wasting bandwidth on one-shot prefills.
load_back_threshold (default 10): Host-to-GPU transfer only triggers if the number of pages to load exceeds this threshold (line 920), batching small loads to amortize transfer overhead.
batch_exists_v2() in HiCacheStorage checks prefix hits across all tiers simultaneously, using PoolHitPolicy (line 46-54) to handle different pool semantics (e.g., Mamba states only need trailing pages, not the full prefix).

Comparison

Dimension	vLLM	SGLang
Tiers	2 (GPU + CPU)	3+ (GPU + CPU + storage backends)
GPU memory management	Custom `cuMemCreate`/`cuMemMap` allocator	Standard PyTorch + JIT transfer kernels
Sleep/wake	Yes (unmap GPU, backup to CPU)	No equivalent
Storage backends	None	NIXL (RDMA), Mooncake, LMCache, HF3FS
Transfer granularity	Block-level sync	Layer-level async JIT kernels
Migration policy	Simple LRU eviction	Hit-count thresholds + batch-size gating

Emerging: CXL memory pooling. CXL 3.0 enables cache-coherent memory sharing across hosts with load/store semantics, potentially replacing RDMA-based KV transfer with simpler shared-memory access. Neither framework supports CXL today, but it represents a natural evolution for multi-node KV caching.

Quiz: Test Your Understanding

Q1 (Calculation). You have an 80 GB H100 GPU running a 70B model with TP=2. Model weights consume 70 GB per GPU, activations take 2 GB, framework overhead is 1 GB. The model has 80 layers, 8 GQA KV heads, head dim 128, FP16 KV. How many tokens of KV cache fit? How does switching to FP8 KV change this?

Q2 (Concept). Why does SGLang's write_through_threshold=2 (write-back mode) make sense for serving workloads with many unique prefills but few cache hits?

12.

Observability + Control Plane

First Principles

An inference server that cannot be observed cannot be operated. The core challenge: LLM serving has multiple latency components that interact nonlinearly. A queue depth of 10 might be fine at short context lengths but catastrophic at 32K contexts because each request consumes 10x more KV cache, triggering preemption cascades. Without per-component metrics, operators cannot distinguish between "GPU compute bound" and "memory thrashing."

Key Metrics and What They Reveal

Metric	What it measures	Why it matters
TTFT	Time from request arrival to first output token	Dominated by prefill time + queue wait. User-perceived responsiveness.
TPOT	Time between consecutive output tokens	Decode iteration latency. Determines streaming smoothness.
TPS (throughput)	Total tokens generated per second (all requests)	The economic metric -- cost per token.
KV cache utilization	Fraction of allocated KV blocks in use	Memory pressure indicator. At >90%, preemption starts.
Queue depth	Number of waiting requests	Leading indicator of latency spikes.
Batch size (running)	Requests currently in a decode batch	Higher = better GPU utilization, but each addition increases per-iteration latency.

The "healthy but degraded" trap. A server can be live (process running), ready (model loaded), and even pass shallow health checks while delivering terrible latency because its KV cache is thrashing. Deep health checks that perform end-to-end generation are the only way to catch this, at the cost of consuming GPU cycles.

vLLM

Prometheus metrics (vllm/v1/metrics/prometheus.py): Sets up multiprocess-safe Prometheus export via setup_multiprocess_prometheus() with a temporary directory for the PROMETHEUS_MULTIPROC_DIR multiprocess collector (line 25). Exposes histograms for TTFT, TPOT, and counters for tokens generated, requests completed.
ORCA headers: Load-balancer integration headers that expose real-time capacity information (queue depth, running batch size) in HTTP response headers, allowing L7 load balancers to make routing decisions without polling a separate metrics endpoint.
OpenTelemetry tracing: Distributed tracing support for tracking request lifecycle across API gateway, scheduler, and model workers.
RequestLogger: Per-request structured logging with timing breakdowns.

SGLang

Prometheus middleware (add_prometheus_middleware in sglang/srt/entrypoints/http_server.py): FastAPI middleware that wraps each request with Prometheus timing and counter instrumentation.
Custom tracing (sglang/srt/observability/trace.py): SGLang's own tracing framework for fine-grained scheduler-level event tracking.
Watchdog (sglang/srt/utils/watchdog.py): A factory-pattern monitor created via Watchdog.create() (base class at line 19), which returns a _WatchdogReal instance (line 47). The watchdog runs a background thread (not a subprocess) that detects hung workers. If a worker fails to heartbeat within a configurable timeout, the watchdog kills it and triggers recovery. This is critical for production deployments where NCCL hangs or GPU errors can silently deadlock a worker.
PyTorch profiler integration: Built-in hooks for capturing torch.profiler traces for kernel-level performance analysis.
Expert distribution recording: For MoE models, tracks which experts are activated and how load is distributed -- essential for diagnosing MoE load imbalance.

Comparison

Dimension	vLLM	SGLang
Metrics export	Prometheus (multiprocess-safe)	Prometheus (FastAPI middleware)
Load balancer integration	ORCA headers	Not built-in
Distributed tracing	OpenTelemetry	Custom trace framework
Failure detection	Process-level	Watchdog (background thread) with heartbeat
Profiling	External tooling	Built-in PyTorch profiler hooks
MoE-specific	No	Expert distribution recording

Fault tolerance gap. Neither framework handles GPU failure gracefully during active generation. A TP rank failure is unrecoverable -- the entire TP group must restart, losing all in-flight requests. PP stage failures cause request timeouts. In disaggregated mode, a prefill-instance failure loses uncommitted KV cache, requiring re-prefill. At 1000+ GPU scale (where per-GPU MTBF is days), this is a significant production concern. Request-level checkpointing and redundant expert placement for MoE are open problems.

Quiz: Test Your Understanding

Q1 (Concept). A serving cluster shows TTFT p50 = 200ms but p99 = 8s. KV cache utilization averages 60%. What is the most likely cause, and which additional metric would confirm it?

Q2 (Concept). Why is a liveness probe insufficient for production LLM serving? Describe a failure mode that passes liveness but causes complete service degradation.

13.

Hardware / Fabric Layer

First Principles

The inference framework must map its logical parallelism strategies onto physical hardware topology. A suboptimal mapping can degrade throughput by an order of magnitude -- for example, running tensor parallelism over PCIe instead of NVLink increases all-reduce latency by ~14x, turning a compute-bound workload into a communication-bound one.

GPU Memory Hierarchy (Key Numbers for H100 SXM)

Level	Capacity	Bandwidth	Latency	Role in LLM Serving
Register file	~256 KB/SM (132 SMs)	N/A (same cycle)	0 cycles	Accumulator registers for matmul, softmax
L1 / SMEM	192 KB/SM (configurable)	~30 TB/s aggregate	~30 cycles	FlashAttention tile storage, reduction scratch
L2 cache	50 MB	~12 TB/s	~200 cycles	KV cache page hot set, weight residency for small layers
HBM3	80 GB	3.35 TB/s	~400 cycles	Model weights, KV cache, activation buffers
PCIe 5.0	Host memory	64 GB/s	~1 us	CPU offload, host KV cache tier
NVLink 4.0	Peer GPU memory	900 GB/s bidir	~1 us	Tensor parallel all-reduce within node
InfiniBand NDR	Remote node	50 GB/s per port (400 Gb/s); DGX H100 has 8x NDR ports = 400 GB/s aggregate per node	~1-5 us	Pipeline parallel activation transfer across nodes

Topology Impact on Parallelism Decisions

Tensor parallelism (TP) requires an all-reduce after every attention and FFN layer. For a hidden dimension $H=8192$ in FP16, the all-reduce message is $H \times 2 = 16$ KB (for a single token in decode). With batching, this becomes $B \times H \times 2$ bytes.

NVLink vs PCIe all-reduce time for a single decode step ($B=64$, $H=8192$, FP16, TP=2):

Message size: $64 \times 8192 \times 2 = 1$ MB.
NVLink (900 GB/s bidir, ~450 GB/s unidirectional effective for ring all-reduce): $1 \text{ MB} / 450 \text{ GB/s} \approx 2.2\;\mu s$.
PCIe 5.0 (64 GB/s, ~32 GB/s effective): $1 \text{ MB} / 32 \text{ GB/s} \approx 31\;\mu s$.
Per-layer overhead difference: $\sim29\;\mu s$. Over 80 layers: $2.3$ ms additional per decode step on PCIe.

For prefill with $T=2048$ tokens: message = $2048 \times 8192 \times 2 = 32$ MB. PCIe: 1 ms/layer. NVLink: 0.07 ms/layer. Over 80 layers: 80 ms vs 5.6 ms -- a 14x gap that makes PCIe TP impractical for latency-sensitive prefill.

Consequence. TP should stay within the NVLink domain (typically 4 or 8 GPUs in one node). Cross-node parallelism should use PP (pipeline parallelism), which transfers only activation tensors at pipeline stage boundaries -- much smaller and less frequent messages.

NVLink vs NVSwitch: point-to-point vs full crossbar. NVLink provides point-to-point connections between GPU pairs. Each NVLink 4.0 link delivers 25 GB/s per direction (50 GB/s bidirectional), and H100 GPUs have 18 links total, but these must be distributed across connected peers. NVSwitch is a fundamentally different component: a non-blocking crossbar switch that connects ALL GPUs in a node simultaneously. DGX H100 uses 4 third-generation NVSwitch chips to create a fully-connected 8-GPU fabric where every GPU has 900 GB/s bidirectional bandwidth to every other GPU concurrently.

This distinction is critical for communication patterns like MoE all-to-all expert dispatch. With NVSwitch, all 8 GPUs can send tokens to all other GPUs simultaneously at full bandwidth (full bisection bandwidth). With only NVLink point-to-point links (no NVSwitch), multi-hop routing would be required for non-adjacent GPU pairs, reducing effective bandwidth and increasing latency. The DGX H100 node topology combines: 8x H100 SXM GPUs + 4x NVSwitch chips (providing the all-to-all GPU fabric) + 8x ConnectX-7 400G InfiniBand NICs (providing 400 GB/s aggregate inter-node bandwidth).

vLLM

Multi-platform backends: CUDA, ROCm, CPU, XPU (Intel), TPU (Google). Backend selection in vllm/platforms/.
PyNVML topology detection: CudaPlatform.is_fully_connected() (vllm/platforms/cuda.py, line 631) uses NVML to query NVLink connectivity between physical device IDs. Returns True only if every pair of GPUs has a direct NVLink connection (1-hop). This gates whether custom all-reduce (optimized for NVLink) or NCCL fallback (handles PCIe + NVSwitch) is used.
Custom all-reduce: custom_all_reduce.py and quick_all_reduce.py implement NVLink-optimized all-reduce that bypasses NCCL overhead for small messages (the decode case where message sizes are KB-scale and NCCL's setup latency dominates).

SGLang

Hardware backends: CUDA, ROCm, NPU (Ascend), MLX (Apple Silicon), MUSA (Moore Threads). Backend-specific graph runners in sglang/srt/hardware_backend/.
NUMA-aware process binding: For multi-socket CPU hosts, SGLang binds worker processes to the NUMA node closest to their assigned GPU, reducing cross-socket memory access latency for CPU-side operations (tokenization, scheduling).
Ascend NPU backend (sglang/srt/hardware_backend/npu/): Full graph runner implementations including EAGLE draft support (eagle_draft_npu_graph_runner.py), demonstrating SGLang's hardware abstraction -- the same speculation logic runs on NPU with backend-specific graph execution.

Comparison

Dimension	vLLM	SGLang
GPU platforms	CUDA, ROCm, XPU, TPU, CPU	CUDA, ROCm, NPU, MLX, MUSA
Topology detection	PyNVML `is_fully_connected()`	Relies on NCCL/torch.distributed
Custom all-reduce	Yes (NVLink-optimized)	Uses NCCL defaults
NUMA binding	No	Yes
NPU support	No (Ascend not supported)	Full Ascend NPU backend
Apple Silicon	No	MLX backend

Quiz: Test Your Understanding

Q1 (Calculation). Compute the all-reduce time for TP=2 with hidden dimension $H=8192$, FP16, batch size $B=1$ (single decode token) on NVLink 4.0 vs PCIe 5.0. Assume effective bandwidth is 50% of peak for both.

Q2 (Concept). A deployment has 2 nodes, each with 4 GPUs connected by NVLink. No NVLink between nodes, only InfiniBand. You need to serve a model requiring 8-way parallelism. Compare TP=8 (all TP) vs TP=4+PP=2 (TP within node, PP across nodes) in terms of communication overhead.

Q3 (Concept). Why does SGLang's NUMA-aware process binding matter for inference serving but not for training?

References & Further Reading

Platform / Stack-Level

BentoML — Inference Platform: The Missing Layer in On-Prem LLM Deployments
NVIDIA — Deploying Disaggregated LLM Inference Workloads on Kubernetes
NVIDIA Dynamo — How NVIDIA Dynamo 1.0 Powers Multi-Node Inference at Production Scale

Engine Internals / Implementation

vLLM — vLLM V1
SGLang — Fast and Expressive LLM Inference with RadixAttention and SGLang
SGLang — Achieving Faster Open-Source Llama3 Serving with SGLang Runtime
Hugging Face / Adyen — LLM Inference at Scale with TGI
TensorRT-LLM — Disaggregated Serving in TensorRT-LLM

Foundational Papers

Orca (Yu et al., OSDI 2022) — Introduced continuous batching for LLM serving
vLLM / PagedAttention (Kwon et al., SOSP 2023) — Paged KV cache management
SGLang / RadixAttention (Zheng et al., 2024) — Radix tree prefix caching
Sarathi (Agrawal et al., 2023) — Chunked prefill to mitigate prefill-decode interference
DistServe (Zhong et al., OSDI 2024) — Disaggregated prefill/decode architecture
Splitwise (Patel et al., ISCA 2024) — P/D disaggregation break-even analysis
FlashAttention (Dao et al., NeurIPS 2022) — IO-aware tiled attention
EAGLE (Li et al., 2024) — Feature-reuse speculative decoding