Who this is for: Whether you're building your own inference engine, evaluating vLLM vs SGLang for production, or simply want to understand how modern LLM serving works under the hood — this guide walks through every layer from first principles. We go deep on GPU architecture and systems design, but every novel concept (paged attention, RadixAttention, continuous batching, speculative decoding, disaggregated serving) is explained from scratch so you can follow along regardless of background.
How to read: Each of the 13 sections covers one layer of the serving stack — starting with the problem it solves, then showing how vLLM and SGLang each tackle it, with code references and comparison tables. Quizzes at the end of each section let you check your understanding. Feel free to jump to any section that interests you.
Codebase: vLLM at bd8bd52 (2026-04-11), SGLang at 78043d4 (2026-04-11).
1.
Front Door / API Layer
First Principles
An inference engine is, at its core, a GPU-bound pipeline: tokenize → schedule → compute → sample → detokenize. The front door's job is to accept HTTP requests, convert them into the engine's internal representation, and stream results back — all without becoming the bottleneck.
The critical insight is that the GPU is the scarce resource, not the CPU. A single H100 costs ~$30K and can process thousands of tokens per second, but only if it's kept busy. Every microsecond the GPU idles because the API layer is slow to enqueue the next request is wasted money. This creates a fundamental design tension:
Tokenization is CPU-bound and variable-latency. A 100K-token prompt takes orders of magnitude longer to tokenize than a 10-token prompt. If tokenization blocks the event loop, GPU scheduling stalls.
Streaming requires per-token delivery. Users expect Server-Sent Events (SSE) with sub-100ms inter-token latency. The API layer must multiplex thousands of concurrent SSE streams without head-of-line blocking.
Serialization overhead matters at scale. At 10K requests/second, the cost of JSON serialization, request validation, and inter-process marshalling becomes non-trivial. Binary protocols (msgspec, protobuf) outperform JSON by 5-10x for structured data.
Process isolation prevents GIL contention. Python's GIL means that tokenization, scheduling, and HTTP handling cannot truly parallelize within a single process. The solution is multi-process architecture with efficient IPC.
Design Space
An implementer faces these choices:
Decision
Options
Trade-off
Web framework
FastAPI, aiohttp, Starlette, raw asyncio
FastAPI gives OpenAPI docs + validation but adds overhead
Event loop
uvloop, default asyncio
uvloop is 2-4x faster for I/O-bound workloads
IPC mechanism
ZMQ, gRPC, Unix pipes, shared memory
ZMQ: low-latency, no schema; gRPC: typed, higher overhead
Serialization
msgspec, protobuf, pickle, JSON
msgspec: zero-copy struct packing; pickle: flexible but slow
Process model
Single-process, 2-process, 3-process
More processes = better isolation, higher IPC cost
Key file: vllm/entrypoints/openai/api_server.py — the build_app() function (line 157) conditionally registers API routers based on supported_tasks. Each task type (generate, chat, transcription, realtime) has its own sub-router module.
Serialization: msgspec structs for EngineCoreRequest and EngineCoreOutput. msgspec is chosen because it supports zero-copy deserialization of typed structs — no schema negotiation, no reflection. The EngineCoreRequest (defined in vllm/v1/engine/__init__.py:80) carries prompt_token_ids, sampling_params, mm_features, arrival_time, priority, and other fields as fixed-layout binary.
Streaming implementation: The OutputProcessor (vllm/v1/engine/output_processor.py:45) uses RequestOutputCollector with asyncio.Event for non-blocking put/get. If the producer (engine) gets ahead of the consumer (HTTP stream), outputs are delta-merged to avoid queue bloat. This is important: a fast GPU can produce tokens faster than a slow client can consume them.
Authentication: Simple bearer-token via AuthenticationMiddleware (api_server.py:273). Set via --api-key or VLLM_API_KEY. No OAuth, JWT, or per-tenant auth.
Rate limiting: None built-in. The server accepts --middleware args for user-supplied middleware classes (line 302-312), so rate limiting is an external concern.
SGLang
Architecture: 3-process model with ZMQ IPC
SGLang separates into three process types:
HTTP server + TokenizerManager: FastAPI + uvicorn. Tokenizes requests and forwards to scheduler.
Scheduler: Owns the GPU worker, runs the scheduling loop, executes model forward passes.
DetokenizerManager: Converts output token IDs back to text, formats responses.
This 3-way split isolates tokenization and detokenization (both CPU-bound) from GPU scheduling (latency-critical). The scheduler process never blocks on string operations.
Key file: sglang/srt/entrypoints/http_server.py — all route registration, lifespan setup, and process orchestration.
Each family has its own serving handler that normalizes to internal GenerateReqInput structs (sglang/srt/managers/io_struct.py).
Multi-tokenizer mode: When --tokenizer-worker-num > 1, SGLang spawns multiple uvicorn worker processes sharing state via shared memory (read_from_shared_memory() at http_server.py:243). Each worker gets its own ZMQ IPC socket. This scales tokenization throughput linearly for high-QPS workloads.
HTTP/2 support: Optional via Granian server (--enable-http2). HTTP/2 multiplexing reduces connection overhead for many concurrent streams.
Authentication: Three auth levels via sglang/srt/utils/auth.py: NORMAL, ADMIN_OPTIONAL, ADMIN_FORCE. Admin endpoints (weight updates, profiling, cache flush) are gated separately from inference endpoints. API key set via --api-key and --admin-api-key.
Health check depth: SGLang's /health endpoint (http_server.py:506-578) sends a real generation request with max_new_tokens=1 to verify the full pipeline is functional. This is a deep health check — it validates tokenization, scheduling, GPU compute, and detokenization. Most frameworks only check process liveness.
IPC implementation detail: Both frameworks use ZMQ PUSH/PULL for request flow (one-directional, non-blocking) and PUB/SUB for stats broadcasting. SGLang additionally uses DEALER/ROUTER for the disaggregated P/D bootstrap handshake. HWM (high-water mark) settings control backpressure -- when the send HWM is reached, ZMQ drops or blocks depending on socket type.
When vLLM wins: Simpler deployment (fewer processes), lower IPC overhead for single-GPU setups, gRPC for internal service meshes.
When SGLang wins: High-QPS scenarios needing tokenizer parallelism, Ollama compatibility for local dev, deep health checks for Kubernetes readiness probes, admin API separation for multi-tenant platforms.
Quiz: Test Your Understanding
Q1 (Concept): Why do both frameworks use ZMQ rather than gRPC for internal IPC between the API process and the scheduler?
ZMQ is a lightweight message-passing library with no schema negotiation, no HTTP overhead, and sub-microsecond latency for intra-host communication. gRPC adds protobuf serialization, HTTP/2 framing, and connection management — overhead that's unnecessary when both endpoints are on the same machine. For internal IPC, the bottleneck is serialization cost, not network reliability. ZMQ with binary serialization (msgspec or dataclass) minimizes this cost. gRPC is better suited for cross-machine communication where its typed schemas and retry semantics add value.
Q2 (Concept): SGLang uses a 3-process model while vLLM uses 2 processes. What's the specific benefit of separating the detokenizer into its own process?
Detokenization is CPU-bound and involves string manipulation (BPE decoding, Unicode handling, special token filtering). In a 2-process model, detokenization either happens in the frontend (competing with HTTP handling and tokenization for CPU) or in the engine core (competing with scheduling). By isolating it, SGLang ensures that: (1) the scheduler process is never blocked by string operations, keeping GPU scheduling latency deterministic, and (2) the HTTP process can focus on I/O multiplexing. The cost is one extra ZMQ hop per token, which is ~1µs — negligible compared to the ~10ms per-token generation time.
Q3 (Calculation): An inference server receives 500 requests/second. Average prompt length is 2000 tokens. Tokenization throughput of a single CPU core using HuggingFace tokenizer is ~500K tokens/second. How many tokenizer worker processes does SGLang need to avoid tokenization becoming the bottleneck?
In practice, you'd want headroom for variance (some prompts are much longer), so 3 workers would be prudent. SGLang's --tokenizer-worker-num=3 would handle this.
Q4 (Concept): vLLM's OutputProcessor uses "delta-merging" when the GPU produces tokens faster than the client consumes them. What would happen without this mechanism?
Without delta-merging, each generated token would be queued as a separate SSE event. If the GPU generates at 100 tokens/s but the client reads at 50 events/s, the queue grows unboundedly — consuming memory and increasing tail latency. Delta-merging collapses multiple pending tokens into a single event (e.g., sending "the quick brown" instead of three separate events for "the", " quick", " brown"). This bounds queue depth and reduces per-event overhead (HTTP framing, JSON serialization) at the cost of slightly burstier delivery — which is invisible to users.
2.
Router / Admission Control
First Principles
In data-parallel (DP) serving, multiple engine replicas share the incoming request stream. A naive round-robin dispatcher treats every request equally, but LLM requests are not equal: a 32-token chat completion costs ~1000x fewer FLOPs than a 32,000-token document summarization. Without token-aware routing, one replica can be saturated with a single long-context request while another sits idle, causing head-of-line (HoL) blocking. The router must solve two problems: (1) distribute load proportional to actual cost, and (2) prevent starvation of queued requests when a high-priority request arrives.
Neither vLLM nor SGLang implements tenant isolation, per-user rate limiting, or request-level SLO routing at this layer. Both assume a cooperative, single-tenant deployment.
Design Space
Decision
Option A
Option B
Implication
Load signal
Queue length
Token count
Token count captures cost heterogeneity
Routing granularity
Per-request
Per-wave (batch)
Per-wave enables cooperative DP scheduling
State sharing
Pull (engine reports)
Push (coordinator broadcasts)
Push is lower latency but more traffic
Preemption at router
None (defer to scheduler)
Router-level priority
Router-level is simpler but coarser
vLLM
vLLM uses DPCoordinator (vllm/v1/engine/coordinator.py:23), a dedicated process that intermediates between front-end API servers and DP engine ranks.
Wave-based cooperative DP. Engines alternate between running and paused states. A global "request wave" counter tracks transitions from running to paused, synchronized via all-reduce in DPEngineCoreProc._has_global_unfinished_reqs. When all engines drain their queues, they collectively pause. When any engine receives a new request, the coordinator broadcasts START_DP_WAVE to resume all engines. This ensures all DP ranks participate in every batch, which is required for operations that involve cross-rank communication (e.g., tensor-parallel all-reduce within each DP group).
Load balancing is front-end side. The coordinator publishes per-engine stats (waiting queue length, running queue length) via ZMQ PUB to all front-ends (coordinator.py:148, EngineState.request_counts). Front-ends make local LB decisions using these stats. In "External LB mode," no stats are published and an external load balancer handles routing.
No priority preemption at the router. Priority is handled at the scheduler level via SchedulingPolicy.PRIORITY (request_queue.py:13), which uses a heap-based RequestQueue.
SGLang
SGLang uses DataParallelController (sglang/srt/managers/data_parallel_controller.py:116) with four pluggable load-balancing methods defined in LoadBalanceMethod (line 70):
TOTAL_TOKENS -- Routes to the rank with the fewest total tokens (prompt + generated). Uses DPBudget.dispatch() (line 99) with token count as primary key and request count as tiebreaker.
TOTAL_REQUESTS -- Routes to the rank with the fewest total requests.
FOLLOW_BOOTSTRAP_ROOM -- Used in disaggregated mode; routes to the rank with the most available bootstrap room for KV transfer.
DPBudget (line 87) maintains per-rank counters updated via WatchLoadUpdateReq messages from schedulers. After dispatch, it optimistically increments the target rank's request count by 1 to avoid burst routing to the same rank before the next stats update.
Priority preemption. SGLang supports priority-based preemption at the scheduler level via PrefillAdder.preempt_to_schedule() (schedule_policy.py:855). When a high-priority request arrives and insufficient budget exists, the scheduler iterates running requests sorted by (priority * -priority_sign, -wait_queue_entry_time) and preempts them. The configurable threshold priority_scheduling_preemption_threshold gates this behavior.
Comparison
Aspect
vLLM
SGLang
LB methods
External or stats-based (front-end)
4 built-in methods
Coordination
Wave-based (cooperative DP)
Independent dispatch
Stats transport
ZMQ PUB from coordinator
ZMQ from schedulers to controller
Priority preemption
Scheduler-level (heap queue)
Scheduler-level (sorted scan + preempt)
Disaggregated routing
Not at router level
FOLLOW_BOOTSTRAP_ROOM
vLLM's wave coordination wins when cross-rank synchronization (TP within DP) is required. SGLang's token-aware dispatch wins for heterogeneous workloads where request costs vary by orders of magnitude.
Quiz: Test Your Understanding
Q1 (Concept). Why does round-robin load balancing fail for LLM serving, even when all replicas have identical hardware?
LLM requests have heterogeneous cost: a 100-token request occupies the GPU for ~1ms while a 100,000-token request occupies it for seconds. Round-robin distributes request *count* evenly but not request *cost*. This causes one replica to be saturated while another is idle, increasing tail latency by the duration of the longest request in the unlucky replica's queue.
Q2 (Calculation). You have DP=4. Current token counts per rank: [12000, 8000, 15000, 8000]. Current request counts: [5, 3, 6, 4]. A new request with 4000 prompt tokens arrives. Under SGLang's TOTAL_TOKENS policy, which rank receives it?
TOTAL_TOKENS picks argmin(total_tokens) with total_requests as tiebreaker. Ranks 1 and 3 both have 8000 tokens. Tiebreaker: rank 1 has 3 requests, rank 3 has 4. Rank 1 wins. After dispatch, DPBudget increments rank 1's request count to 4.
Q3 (Concept). Explain why vLLM's wave-based coordination is necessary when DP is combined with tensor parallelism.
In DP+TP, each DP group runs tensor-parallel all-reduce across its TP ranks during every forward pass. If one DP group has work and another is idle, the idle group's TP ranks will not participate in collective communication, causing a deadlock or requiring expensive barrier synchronization. Wave coordination ensures all DP groups run forward passes in lockstep, so collectives proceed uniformly.
3.
Global Scheduler
First Principles
INTERACTIVE
STATIC BATCHING — wasted slots in gray
CONTINUOUS BATCHING — no wasted slots
The scheduler is the brain of the serving engine. It decides which requests get GPU time at each iteration and how much. Three fundamental insights drive its design.
Insight 1: Static batching wastes GPU. In static batching, you pad all sequences to max_len and run them as a fixed batch. GPU utilization = sum(actual_tokens) / (B * max_len). For a batch of 32 requests where the shortest finishes at 50 tokens and the longest at 2048, utilization can drop below 10%. Continuous batching (Orca, Yu et al. 2022) removes this waste by operating at *iteration granularity*: after each forward pass, finished requests are evicted and new requests are admitted, so the GPU always processes actual tokens.
Insight 2: Prefill and decode have opposite resource profiles. For a transformer layer with hidden dim h, number of heads n_h, head dim d, and intermediate dim d_ff:
Prefill (processing P input tokens): FLOPs per layer ~ 2P * params_per_layer. For Llama-70B with GQA (64 query heads, 8 KV head groups, head_dim=128): attention linear params = Q [8192,8192] + K [8192,1024] + V [8192,1024] + O [8192,8192] = 67M + 8.4M + 8.4M + 67M ~ 151M; FFN params (SwiGLU gate+up+down) = 3 * 8192 * 28672 ~ 704M; total ~ 855M params/layer. FLOPs/layer ~ 2P * 855M. Memory traffic ~ 2 * 855M bytes (load weights once, FP16). Arithmetic intensity ~ P FLOPs/byte. For P=4096, this is massively compute-bound.
Decode (generating 1 token per request in batch B): FLOPs per layer ~ 2B * params_per_layer. Memory traffic is still ~ 2 * params_per_layer bytes (weights dominate). Arithmetic intensity ~ B FLOPs/byte. For B=1, this is catastrophically memory-bandwidth-bound.
Concrete example: Llama-70B with GQA (h=8192, d_ff=28672, 64 query heads, 8 KV head groups, head_dim=128, 80 layers, FP16). Attention linear params per layer: Q+O = 2 * 8192^2 = 134M; K+V = 2 * 8192 * 1024 = 16.8M; total attention ~ 151M. FFN params per layer: 3 * 8192 * 28672 = 704M. Total per layer: ~855M params, weight bytes = 2 * 855M = 1.71 GB. Total weight memory: 80 * 1.71 = 137 GB ~ the commonly cited ~140 GB for 70B FP16. H100 SXM: 989 TFLOPS FP16, 3.35 TB/s HBM bandwidth. Balance point: 989e12 / 3.35e12 ~ 295 FLOPs/byte. Decode at B=1 has intensity ~1, so it is 295x underutilizing compute. Prefill at P=4096 has intensity ~4096, so it is ~14x oversubscribing compute relative to bandwidth (compute-bound).
Insight 3: Mixed batching creates interference. A long prefill (e.g., 8192 tokens) in the same batch as decode requests adds seconds of latency to the decode tokens (which would otherwise complete in ~40ms). The scheduler must manage this interference.
Academic context. The continuous batching paradigm was introduced by Orca (Yu et al., OSDI 2022). Sarathi (Agrawal et al., 2023) introduced chunked prefill to mitigate prefill-decode interference. DistServe (Zhong et al., OSDI 2024) and Splitwise (Patel et al., ISCA 2024) formalized disaggregated prefill/decode architectures.
Design Space
Decision
Option A
Option B
Implication
Batch model
Unified (no phases)
Explicit EXTEND/DECODE
Unified is simpler; explicit enables per-phase optimization
Chunked prefill
Fixed chunk size
Page-aligned chunks
Page alignment avoids fragmentation
Budget tracking
Single token budget
Multi-budget (input, chunk, total)
Multi-budget prevents long prefills from starving decode
Admits new requests (or continues chunked prefills) up to the budget.
Assembles the batch and dispatches to GPU.
The key invariant is that the total tokens processed in one forward pass never exceeds max_batch_tokens (the GPU's capacity for a single iteration).
Chunked Prefill
Without chunked prefill, a 32,000-token prompt monopolizes the GPU for the entire prefill duration, blocking all decode requests. Chunked prefill splits the prompt into chunks (e.g., 4096 tokens each) and interleaves them with decode iterations. The trade-off is chunk size vs. decode latency:
Large chunks: Better prefill throughput (fewer kernel launches, better GPU utilization per chunk) but higher decode latency spikes.
Small chunks: Lower decode latency but more overhead and potentially lower prefill throughput.
SGLang's event_loop_overlap (scheduler.py:1381-1433) pipelines CPU scheduling with GPU execution. The sequence per iteration is:
Receive requests from tokenizer (CPU).
Get next batch -- run the scheduling algorithm (CPU).
Launch current batch on GPU (async kernel launch).
Process previous batch results -- update KV cache metadata, check stop conditions, send outputs (CPU) -- *while the GPU is running the current batch*.
Launch sampling for the current batch if needed (depends on previous batch's grammar state).
The critical overlap: step 4 (CPU-heavy) runs concurrently with step 3's GPU execution. For workloads where CPU scheduling time is non-negligible (e.g., large radix tree operations, grammar checking), this reclaims CPU time that would otherwise be on the critical path. The is_disable_overlap_for_batch method (line 1435) disables overlap for consecutive prefill batches to optimize TTFT.
Adaptive Token Ratio (SGLang)
SGLang's new_token_ratio (schedule_policy.py:649-654) controls how much memory to reserve for future decode tokens. When admitting a prefill, the scheduler must reserve max_new_tokens * new_token_ratio KV slots for that request's future generated tokens. A ratio of 1.0 is maximally conservative (reserve for full generation length), while lower values allow more concurrent requests but risk OOM if many requests generate long outputs.
The _get_running_request_total_token_offset method (line 447) computes each running request's reservation: min(max_new_tokens - len(output_ids), CLIP_MAX_NEW_TOKENS) * new_token_ratio. This is accumulated into rem_total_token_offset, which is subtracted from available KV pool capacity to compute rem_total_tokens.
"There's no 'decoding phase' nor 'prefill phase' in the scheduler. Each request just has the num_computed_tokens and num_tokens_with_spec. At each step, the scheduler tries to assign tokens to the requests so that each request's num_computed_tokens can catch up its num_tokens_with_spec."
Scheduling loop. The schedule() method (line 348) processes in order:
Running requests (line 383): iterate self.running, compute num_new_tokens = num_tokens_with_spec + num_output_placeholders - num_computed_tokens, clamp to token_budget and long_prefill_token_threshold for chunking.
Waiting requests (from RequestQueue): pop from a deque (FCFS) or heapq (Priority) via request_queue.py:13-18.
Preemption: If KV allocation fails for a running request, preempt via num_computed_tokens = 0 (full recompute). The preempted request is re-queued.
Key classes:
Scheduler (scheduler.py:67) -- main scheduler.
RequestQueue (request_queue.py:20) -- abstract base; FcfsRequestQueue (deque) and PriorityRequestQueue (heapq).
SchedulingPolicy (request_queue.py:13) -- enum: FCFS or PRIORITY.
SGLang's Approach: Explicit EXTEND/DECODE with Multi-Budget
SGLang uses a mixin-based scheduler (sglang/srt/managers/scheduler.py) with explicit EXTEND (prefill) and DECODE modes.
PrefillAdder (schedule_policy.py:375) is the core budget manager with three simultaneous constraints:
rem_input_tokens (line 383): max new input tokens this iteration (controls prefill admission).
rem_chunk_tokens (line 384): max tokens for a single chunked prefill (controls chunk size).
rem_total_tokens (property, line 457): available KV pool capacity minus reservations for future decode tokens.
The add_one_req method (line ~600) checks all three budgets before admitting a request. If the request fits within rem_chunk_tokens, it is admitted as a full prefill. Otherwise, it is truncated to rem_chunk_tokens // page_size * page_size (page-aligned chunking, line 827).
Preemption with partial reuse. When preempting, SGLang's radix tree retains the computed KV cache prefix. The preempted request can later resume from the cached prefix rather than recomputing from scratch. This is a significant advantage over vLLM's full-recompute strategy for long-context requests.
Comparison
Aspect
vLLM
SGLang
Phase model
Unified (num_computed catching up)
Explicit EXTEND/DECODE
Budget tracking
Single token_budget
Triple: input, chunk, total
Chunked prefill
long_prefill_token_threshold
Page-aligned rem_chunk_tokens
Preemption cost
Full recompute (num_computed_tokens=0)
Partial reuse via radix tree
CPU-GPU overlap
Not in scheduler loop
event_loop_overlap
Decode reservation
Implicit (KV blocks allocated on demand)
Explicit new_token_ratio reservation
Scheduling policy
FCFS or Priority (heapq)
FCFS with priority preemption
vLLM's unified model is conceptually cleaner and naturally handles speculative decoding (spec tokens are just more tokens to "catch up"). SGLang's explicit phases enable more aggressive optimization: page-aligned chunking, triple-budget control, and overlap scheduling.
Quiz: Test Your Understanding
Q1 (Concept). Explain why continuous batching improves GPU utilization over static batching. What is the utilization formula for static batching with B requests where request i generates L_i tokens and the maximum is L_max?
Static batching pads all requests to L_max. GPU utilization = sum(L_i) / (B * L_max). If L_i varies widely (e.g., 50 to 2048), utilization can be sum(L_i)/(B*2048), which can be <10%. Continuous batching evicts finished requests and admits new ones at each step, so every token processed is an actual token, giving near-100% utilization of the batch dimension.
Q2 (Calculation). Given max_batch_tokens = 8192, 10 running decode requests (each consuming 1 token per step), and a new 6000-token prefill request in the queue. Assume rem_chunk_tokens = 4096 and page_size = 16. Under SGLang's PrefillAdder: (a) How many tokens are available for prefill input? (b) How is the 6000-token request handled?
(a) rem_input_tokens = max_batch_tokens - decode_tokens = 8192 - 10 = 8182. But rem_chunk_tokens = 4096. (b) Since 6000 > rem_chunk_tokens (4096), the request is chunked. trunc_len = 4096 // 16 * 16 = 4096 (already page-aligned). So the first 4096 tokens are prefilled this iteration. The remaining 1904 tokens are deferred to the next iteration(s). After this admission, rem_input_tokens decreases by 4096, rem_chunk_tokens drops to 0, and the request is marked as a new_chunked_req.
Q3 (Calculation). A request has max_new_tokens = 2000 and has already generated 500 tokens. new_token_ratio = 0.7. What is the token reservation for this request's future decode tokens?
tokens_left = min(max_new_tokens - len(output_ids), CLIP_MAX_NEW_TOKENS) * new_token_ratio = min(2000 - 500, CLIP_MAX_NEW_TOKENS) * 0.7. Assuming CLIP_MAX_NEW_TOKENS >= 1500: reservation = 1500 * 0.7 = 1050 tokens. This 1050 is added to rem_total_token_offset, reducing the available pool for new admissions.
Q4 (Concept). Why does vLLM's full-recompute preemption strategy become expensive for long-context requests, and how does SGLang's radix tree mitigate this?
vLLM sets num_computed_tokens = 0 on preemption, requiring the entire prompt to be re-prefilled. For a 32K-token prompt, this costs ~2 * 32768 * 855M * 80 FLOPs (where 855M is the per-layer parameter count for Llama-70B with GQA), which can take seconds. SGLang's radix tree retains the KV cache for the prefix that was already computed. When the request is rescheduled, it resumes from the cached prefix (e.g., if 28K of 32K tokens were cached, only the remaining 4K need re-prefilling). The savings are proportional to the cached prefix fraction.
Q5 (Concept). What CPU operations does SGLang's overlap scheduling hide, and why is overlap disabled for consecutive prefill batches?
Overlap hides: processing previous batch results (updating KV metadata, checking stop conditions, radix tree insertion, sending outputs to clients, grammar state updates). These are CPU-bound operations that run while the GPU executes the current batch. Overlap is disabled for consecutive prefills because the second prefill's TTFT would be delayed by the first prefill's GPU time plus the CPU processing of the first prefill's results. By processing the first prefill's results immediately (without overlap), the second prefill can start sooner, improving TTFT for the first request.
4.
Prefill Tier
First Principles
Prefill is the phase where the model processes the full input prompt to populate the KV cache. It is compute-bound because every input token requires a full forward pass through all layers, and the arithmetic intensity scales with the sequence length.
FLOPs derivation for Llama-70B at 4096 input tokens. Per transformer layer, the dominant operations are the linear projections. With h=8192, d_ff=28672, GQA (64 query heads, 8 KV head groups, head_dim=128):
Attention linear projections: Q [8192,8192] + K [8192,1024] + V [8192,1024] + O [8192,8192] = 151M params. FLOPs = 2 * 4096 * 151M = 1.24T FLOPs
FFN (gate + up + down, with SwiGLU): 2 * 4096 * 3 * 8192 * 28672 = 5.8T FLOPs (3 projections for SwiGLU)
Attention (Q-K dot product + softmax-V): 2 * n_query_heads * seq_len^2 * head_dim = 2 * 64 * 4096^2 * 128 = 4.4T FLOPs. Note: GQA reduces KV *memory* (8 KV head groups share across 64 query heads) but does NOT reduce query-side attention compute -- all 64 query heads still compute their own Q-K dot products and softmax-V products.
Per layer total: 1.24T (linear) + 5.8T (FFN) + 4.4T (attention) = ~11.4T FLOPs. Across 80 layers: ~915T FLOPs.
H100 SXM delivers 989 TFLOPS FP16 (sustained ~750 TFLOPS). Time: 915T / 750T ~ 1.2 seconds. This is firmly compute-bound: the weight data loaded is ~137 GB, which at 3.35 TB/s takes ~41ms. The compute takes ~29x longer than the memory transfer.
Prefix sharing economics. In production, many requests share a system prompt (e.g., 2000 tokens). Without prefix sharing, N requests each pay the full 2000-token prefill cost. With prefix caching, the system prompt is computed once and the KV cache is reused, saving (N-1) * 2000 * cost_per_token FLOPs. For N=100, this is a 99% reduction in redundant compute for the shared prefix.
Disaggregated prefill. When the prefill workload is large enough, it can be beneficial to dedicate separate GPU instances to prefill vs. decode. The key trade-off: the KV cache must be transferred from prefill to decode instances. This is economical when transfer_time < scheduling_benefit. The scheduling benefit comes from eliminating prefill/decode interference entirely.
vLLM
vLLM does not have a separate prefill tier by default. All requests go through the unified scheduler. However, it provides a KV connector framework for disaggregated serving:
KVConnectorBase_V1 (vllm/distributed/kv_transfer/kv_connector/v1/base.py) defines the scheduler-side and worker-side primitives.
KVConnectorRole enum: SCHEDULER (metadata binding) and WORKER (actual KV transfer).
The connector provides start_load_kv(), wait_for_layer_load(), save_kv_layer(), and wait_for_save() for layer-pipelined KV transfer.
KVCacheEvent and KVEventBatch (kv_events.py) handle event notification.
SGLang
SGLang provides first-class disaggregated prefill via --disaggregation-mode prefill|decode.
Bootstrap handshake.PrefillBootstrapQueue (sglang/srt/disaggregation/prefill.py:87) manages the handshake between prefill and decode instances. It initializes a CommonKVManager with KV data pointers, page size, and transfer backend configuration. The prefill instance computes the KV cache and then transfers it to the decode instance.
KV events.EventBatch and KVCacheEvent (sglang/srt/disaggregation/kv_events.py:38-56) are msgspec-based structs for efficient serialization. Events are published via ZMQ PUB when blocks are stored, notifying decode instances that KV data is ready.
DecodeReqToTokenPool pre-allocation. The decode instance pre-allocates token pool slots before the KV transfer begins, avoiding allocation contention during transfer.
Transfer backends. SGLang supports multiple transfer backends including NCCL, NIXL (NVLink-optimized), and custom RDMA implementations.
Comparison
Aspect
vLLM
SGLang
Disaggregation support
KV connector framework (pluggable)
First-class --disaggregation-mode flag
KV transfer abstraction
Layer-pipelined (load/save per layer)
Block-level with bootstrap handshake
Event notification
KVEventBatch
EventBatch via ZMQ PUB
Prefix caching
Block-level hash matching
Radix tree (character-level granularity)
Transfer backends
Pluggable connectors
NCCL, NIXL, custom RDMA
Quiz: Test Your Understanding
Q1 (Calculation). Calculate the KV cache size for a single request with 8192 tokens on Llama-70B (80 layers, 8 KV heads per layer with GQA, head_dim=128, FP16). How long does transfer take on (a) NVLink 4.0 (900 GB/s), (b) InfiniBand NDR (400 Gb/s = 50 GB/s per port), (c) PCIe 5.0 x16 (64 GB/s)?
Total across 80 layers: 80 * 32 MB = 2560 MB = 2.5 GB.
Transfer times:
(a) NVLink 4.0: 2.5 GB / 900 GB/s = 2.8 ms
(b) InfiniBand NDR: 2.5 GB / 50 GB/s = 50 ms
(c) PCIe 5.0 x16: 2.5 GB / 64 GB/s = 39 ms
For disaggregated prefill to be worthwhile, the scheduling benefit (eliminating prefill/decode interference) must exceed the transfer time. At 50ms over InfiniBand NDR, this is only justified for very long prompts or high-QPS scenarios.
Q2 (Concept). Under what conditions does disaggregated prefill improve overall throughput compared to colocated prefill+decode?
Disaggregated prefill helps when: (1) prefill requests are long enough that their interference with decode causes significant TPOT (time per output token) degradation, (2) the KV transfer time is small relative to the prefill compute time, and (3) the prefill and decode load can be balanced across dedicated instances. It hurts when prefill requests are short (transfer overhead > interference cost) or when load is bursty and one tier sits idle.
Q3 (Concept). Explain why radix-tree prefix caching (SGLang) can share prefixes at finer granularity than block-level hash matching (vLLM).
Block-level hashing computes a hash over each fixed-size block of tokens. Two requests share a prefix only if their token sequences match at block boundaries. If one request has a 2001-token prefix and another has a 2003-token prefix (with block_size=16), the last partial block differs and cannot be shared. A radix tree stores the token sequence as a trie, allowing sharing at arbitrary token boundaries. The 2001-token and 2003-token prefixes share the first 2001 tokens exactly, with no wasted recomputation.
5.
Decode Tier
First Principles
Decode is the autoregressive phase: one new token per request per step. It is memory-bandwidth-bound because each token requires loading the full model weights but performs minimal computation per loaded byte.
Arithmetic intensity derivation for Llama-70B at batch=1. Each parameter performs 2 FLOPs (multiply-add) and costs 2 bytes to load (FP16). So arithmetic intensity at batch B = (2*B) / 2 = B FLOPs/byte. At B=1: intensity = 1 FLOPs/byte. H100 balance point is ~295 FLOPs/byte (989T / 3.35T). Decode at B=1 is 295x below the compute saturation point.
The batch-size lever. Increasing batch size B scales FLOPs linearly (each request adds one token) without proportionally increasing memory traffic (weights are loaded once per step regardless of B). Intensity at batch B = B FLOPs/byte. Compute saturation: B = 295. Above B=295, decode becomes compute-bound on H100.
In practice, KV cache memory limits B long before compute saturation. For Llama-70B on a single H100 (80GB), after weights (140GB -- requires at least 2 GPUs), KV cache per token per layer is 2 * 8 * 128 * 2 = 4 KB, total per token 80 * 4 KB = 320 KB. With 40GB for KV cache (rough, on a 2-GPU setup), max tokens = 40GB / 320KB ~ 125K tokens. At average context length 2048, max B ~ 60 -- well below the compute saturation point of 295.
KV cache bandwidth overhead. The analysis above considers only weight-loading bandwidth. At longer contexts, KV cache reads add significant pressure. For seq_len=2048, 8 KV heads, head_dim=128, FP16: KV read per token per layer = 2 (K+V) * 2048 * 8 * 128 * 2 bytes = 8 MB. Across 80 layers per step: 640 MB per request. At B=64: 640 MB * 64 = ~40 GB of KV reads per step, on top of ~137 GB of weight reads -- roughly a 30% bandwidth overhead. This shifts the effective balance point lower than the weight-only estimate of B=295, meaning decode becomes compute-bound at a smaller batch size than the naive calculation suggests.
Max tokens/second at B=1. Time per step = weight_load_time = 137 GB / 3.35 TB/s = 40.9 ms. Throughput = 1 / 0.0409 ~ 24 tokens/second. This is the theoretical maximum for single-request decode latency on H100.
vLLM
vLLM uses the same unified scheduler for decode. There is no separate decode mode or tier.
Preemption = full recompute. When memory pressure forces preemption, vLLM resets num_computed_tokens = 0. The preempted request must re-prefill its entire prompt. For a 32K-token context, this costs the same as the original prefill (~0.75s on H100 for 70B).
Decode batching. Running requests are processed first in schedule() (line 383). Each running request contributes num_tokens_with_spec - num_computed_tokens tokens (typically 1 for standard decode, more with speculative decoding). The budget consumed by decode is minimal per request, leaving room for new prefill admissions.
SGLang
SGLang has an explicit DECODE forward mode that enables decode-specific optimizations.
Retraction (defer without freeing KV). When the scheduler needs to reduce batch size but doesn't want to lose computed KV cache, it can "retract" a request -- temporarily remove it from the active batch while keeping its KV cache in the radix tree. When re-admitted, the request resumes without any recomputation.
Preemption with partial reuse. When preemption is necessary (e.g., for a higher-priority request), the radix tree retains as much of the KV prefix as possible. The preempted request's KV cache is not immediately freed; it remains in the tree as evictable entries. If the request is rescheduled before eviction, it resumes from the cached state.
Decode-specific batching. In decode mode, SGLang can apply optimizations like CUDA graph capture for fixed batch sizes, avoiding kernel launch overhead for the small per-token computation.
Comparison
Aspect
vLLM
SGLang
Decode mode
Unified (no separate mode)
Explicit DECODE forward mode
Preemption cost
Full recompute
Partial reuse via radix tree
KV retention on preemption
Freed immediately
Retained as evictable in radix tree
Retraction (defer)
Not available
Supported
CUDA graph decode
Supported
Supported
Batch size control
Token budget based
Token budget + max_running_requests
Quiz: Test Your Understanding
Q1 (Calculation). For Llama-70B FP16 on H100 SXM (3.35 TB/s HBM, 989 TFLOPS): (a) What is the maximum decode throughput at B=1 in tokens/second? (b) At what batch size does decode become compute-bound? (c) What is the maximum decode throughput at B=256?
(a) Time per step at B=1 = 137 GB / 3.35 TB/s = 40.9 ms. Throughput = 1/0.0409 = 24 tok/s per request (also 24 tok/s total since B=1).
(b) Balance point: each parameter does 2 FLOPs and costs 2 bytes (FP16), so intensity = B FLOPs/byte. Compute saturation at B = 295 (= 989T/3.35T). Above B=295, decode is compute-bound.
(c) At B=256 (still bandwidth-bound since 256 < 295): time per step is still ~40.9 ms (dominated by weight loading). Total throughput = 256 / 0.0409 = 6,260 tok/s. Per-request throughput = 6260/256 = 24 tok/s (same as B=1 -- this is the beauty of batching in the bandwidth-bound regime: total throughput scales linearly with B while per-request latency stays constant).
Q2 (Calculation). You have 40 GB of KV cache memory for Llama-70B (80 layers, 8 KV heads, head_dim=128, FP16). What is the maximum batch size at average context length 2048? At context length 512?
At context 2048: tokens per request = 2048. Memory per request = 2048 * 320 KB = 640 MB. Max B = 40 GB / 640 MB = 62.
At context 512: tokens per request = 512. Memory per request = 512 * 320 KB = 160 MB. Max B = 40 GB / 160 MB = 250.
Note that B=250 is still below the compute saturation point (295), so even at short context lengths, decode remains bandwidth-bound on H100 given typical KV cache constraints.
Q3 (Concept). Why is SGLang's retraction mechanism more efficient than vLLM's preemption for transient load spikes?
A transient load spike may require temporarily reducing batch size. vLLM's preemption resets num_computed_tokens to 0, requiring full re-prefill when the request is rescheduled -- potentially seconds of wasted compute. SGLang's retraction removes the request from the active batch but keeps its KV cache in the radix tree. When load subsides (often within a few iterations), the request resumes with zero recomputation. The only cost is the memory occupied by the retained KV cache during the retraction period. For spikes lasting < 1 second, retraction avoids the multi-second re-prefill penalty entirely.
6.
KV-Cache System
First Principles
INTERACTIVE — click a logical block
LOGICAL VIEW
BLOCK TABLE
Logical
Physical
R1:0
Block 3
R1:1
Block 7
R1:2
Block 1
R2:0
Block 5
R2:1
Block 2
PHYSICAL (GPU HBM)
Click a logical block to trace its physical location
The KV cache stores key and value projections from every attention layer so that previously computed tokens are not recomputed during autoregressive decode. The fundamental allocation problem is that you do not know a request's output length at arrival time. If you reserve max_seq_len tokens of contiguous memory for every request, the internal fragmentation is:
If max_seq_len = 8192 and a typical request uses 1024 tokens, that is 7168 * 320 KiB = 2.24 GiB wasted per request. With 64 concurrent requests, waste exceeds 140 GiB -- more than any single GPU's HBM. Naive contiguous allocation is a non-starter at scale.
Paged Attention borrows the core idea from OS virtual memory: break the KV cache into fixed-size blocks (pages) and map logical token positions to physical blocks through a block table. Each block holds block_size tokens (typically 16). Per-block size:
Allocation granularity drops from max_seq_len to block_size. Fragmentation is at most one partially-filled block per request -- under 5 MiB instead of 2+ GiB.
Capacity planning. For 80 GiB HBM, suppose model weights occupy 35 GiB (a 70B model at FP16 with TP=2 puts ~35 GiB of weights per GPU). Remaining KV budget: 45 GiB. Total blocks: 45 * 1024 / 5 = 9216 blocks = 147,456 tokens of KV capacity.
Design Space
Decision
Option A
Option B
Implication
Prefix dedup structure
Hash table (vLLM)
Radix tree (SGLang)
O(1) lookup vs O(k) traversal; tree supports mid-node splitting
Hashing granularity
Block-aligned only
Arbitrary prefix
Hash table cannot cache partial blocks; tree can split at any token
Eviction policy
Single LRU
Pluggable (LRU/LFU/SLRU/...)
More policies enable workload-specific tuning
Hash function
SHA256 / xxHash
Position-aware chained hash
Chained hash encodes position; content-only hash is position-agnostic
Block size
Larger (16-32)
Smaller (1-4)
Larger reduces table overhead but increases fragmentation
vLLM
Core data structures. The KVCacheBlock dataclass (vllm/v1/core/kv_cache_utils.py:110-155) holds block_id, ref_cnt, and an optional _block_hash (a BlockHashWithGroupId). The doubly-linked pointers prev_free_block / next_free_block are embedded directly on the block to avoid separate node allocations.
FreeKVCacheBlockQueue (kv_cache_utils.py:158-366) is a doubly-linked list with sentinel head/tail nodes. All operations -- popleft, remove, append -- are O(1) with zero Python object allocation (pointer swizzling on existing attributes). The ordering implements LRU: when a block is freed, it is appended to the tail. Eviction pops from the head (least recently used). Blocks with ref_cnt > 0 are never in the free list, preventing eviction of shared prefix blocks.
Hash-based prefix dedup.BlockHash is a NewType over bytes (kv_cache_utils.py:37). The hash function is configurable: SHA256 via CBOR serialization (hashing.py:43-58) or xxHash-128 (hashing.py:76-79). The hash input is the token IDs within the block plus optional extra keys (multimodal hashes, LoRA names, cache salts -- need_extra_keys at line 369). A BlockHashToBlockMap maps BlockHashWithGroupId -> KVCacheBlock, enabling O(1) lookup of whether a given block of tokens already exists in the cache.
Limitation: block-boundary alignment. If a cached prefix is 30 tokens and block_size=16, only the first block (tokens 0-15) is fully cached and hashable. Tokens 16-29 form an incomplete second block with no hash entry. A new request sharing 25 tokens of that prefix can only reuse the first block -- the remaining 9 shared tokens are recomputed. This is the fundamental cost of block-granularity hashing.
SGLang
RadixCache (sglang/python/sglang/srt/mem_cache/radix_cache.py) uses a radix tree (trie) where each edge is labeled with a sequence of token IDs and each node stores KV cache indices.
TreeNode (line 121-147) contains:
children: dict -- regular dict with explicit child creation in _split_node (using defaultdict would create ghost nodes on key lookup)
parent: TreeNode -- back-pointer for lock-ref propagation
key: RadixKey -- the token ID sequence on the edge from parent to this node
value: Optional[torch.Tensor] -- KV cache slot indices for this node's tokens
lock_ref: int -- reference count; when > 0, node is protected from eviction
hash_value: Optional[List[str]] -- page-level hashes for disaggregated KV transfer
hit_count, last_access_time, creation_time, priority -- metadata for eviction
Important implementation detail: TreeNode.key is typed RadixKey, not raw token IDs. RadixKey supports an extra_key field for multimodal/LoRA discrimination -- two sequences with identical tokens but different LoRA adapters or multimodal inputs will NOT share cache. The child lookup uses a configurable get_child_key_fn (see radix_cache.py:184).
Prefix matching (_match_prefix_helper, line 667-691): walks down the tree matching token IDs. If a match ends mid-node (the incoming key shares only a prefix of an edge label), _split_node (line 693-713) breaks the edge into two: one node for the shared prefix, one for the remainder. This means SGLang can cache and share prefixes at arbitrary token boundaries, not just block boundaries.
Fragmentation warning: _split_node (radix_cache.py:702) clones tensor values via .clone(), allocating new GPU memory for KV indices on every mid-node split. Under workloads with high prefix diversity (thousands of unique system prompts sharing partial prefixes), repeated splits create many small tensor allocations that fragment the KV pool. vLLM's block-aligned approach avoids this cost entirely by never splitting below block granularity.
Eviction strategies (evict_policy.py): seven pluggable strategies that return a priority tuple for heap-based eviction:
Strategy
Priority Key
Use Case
LRU (line 16)
last_access_time
General-purpose, default
LFU (line 22)
(hit_count, last_access_time)
Protect frequently reused system prompts
FIFO (line 27)
creation_time
Predictable, simple
MRU (line 32)
-last_access_time
Scan-resistant workloads
FILO (line 37)
-creation_time
Retain oldest (shared) prefixes
Priority (line 41)
(priority, last_access_time)
Application-set priority levels
SLRU (line 49)
(is_protected, last_access_time)
Two-segment: probationary + protected
Eviction (evict(), at radix_cache.py:582) uses a maintained evictable_leaves set (not full tree traversal) with a min-heap for priority ordering. Only leaf nodes with lock_ref == 0 are candidates. When a leaf is evicted, its parent may become a new leaf candidate if all children are gone. Lock-ref propagation (inc_lock_ref, line 611-624) walks from the matched node up to the root, incrementing lock_ref at each ancestor. This ensures no ancestor of an in-use node can be evicted.
Comparison
Dimension
vLLM
SGLang
Data structure
Flat hash table
Radix tree (trie)
Lookup complexity
O(1) amortized
O(k/page_size) where k = prefix length
Prefix sharing granularity
Block-aligned only
Arbitrary (mid-node split)
In-batch sharing
Via ref_cnt on shared blocks
Implicit -- multiple requests traverse same tree path
~300 bytes (TreeNode with children dict, metadata)
Hash function
SHA256-CBOR or xxHash-128
Position-aware chained hashing
When vLLM wins: High-throughput serving with uniform request patterns where block-aligned caching suffices and the O(1) lookup matters at scale. The lower per-block memory overhead also helps when block counts exceed millions.
When SGLang wins: Workloads with diverse, overlapping prefixes (multi-turn chat, few-shot prompting with varied examples) where arbitrary-boundary prefix sharing yields materially higher cache hit rates. The pluggable eviction strategies help when workload access patterns are non-LRU.
Quiz: Test Your Understanding
Q1 (Calculation). A Llama-70B model uses GQA with 8 KV heads, head_dim=128, 80 layers, FP16. The serving GPU has 80 GiB HBM. Model weights under TP=2 consume 35 GiB per GPU. Block size = 16. (a) What is the per-token KV size? (b) Per-block KV size? (c) How many blocks fit in the remaining HBM? (d) Maximum tokens of KV capacity?
Q2 (Concept). Why does content-only hashing (hashing just the token IDs in a block) break correctness with ALiBi or other position-dependent encodings, while it works for RoPE applied at attention time?
With RoPE, positional encoding is applied to Q and K at attention computation time using the absolute position, so the cached K/V values are position-independent -- the same token IDs at different absolute positions produce identical K/V. Content-only hashing is safe because matching content implies matching K/V.
With ALiBi, the bias is a function of the relative distance between query and key positions. The K/V values themselves are position-independent, BUT if a system encodes absolute position into K/V (as some ALiBi variants or learned positional embeddings do), then the same token IDs at different positions produce different K/V. Content-only hashing would match blocks that have different K/V values, corrupting attention output. You would need position-aware hashing (chaining with parent hash or including absolute position in the hash input).
Q3 (Design). SGLang's radix tree splits nodes when a prefix match ends mid-edge. Explain why a hash table cannot support this operation, and what the concrete cache-miss cost is for a workload where 80% of requests share a 100-token prefix but block_size=16.
A hash table keys on complete block contents. If two requests diverge at token 100, the first 6 complete blocks (tokens 0-95) are cacheable. Tokens 96-99 form a partial 7th block that cannot be hashed or shared. The hash table has no mechanism to represent "the first 4 tokens of this block are shared." Each new request must recompute tokens 96-99, wasting 4 tokens of KV compute per request.
In a radix tree, the node covering tokens 96-111 would be split at position 100: a new node for tokens 96-99 (shared) and a child for tokens 100-111 (divergent). The shared 4-token prefix is reused. Over 1000 requests, the hash table wastes 1000 * 4 * 320 KiB = 1.25 GiB of redundant KV compute; the radix tree wastes zero.
Q4 (Concept). In vLLM's FreeKVCacheBlockQueue, why are sentinel (fake) head and tail nodes used instead of null checks?
Sentinel nodes eliminate branch instructions in every linked-list operation. Without sentinels, popleft, append, and remove each need to check whether the node is the head or tail and handle those cases separately (4-6 branches per operation). With sentinels, every real node always has valid prev_free_block and next_free_block pointers, so the pointer-swizzling code is branchless. On the critical scheduling path executed thousands of times per second, eliminating these branches measurably reduces Python overhead. This is a classic systems optimization borrowed from Linux kernel linked-list implementations.
7.
Distributed Execution Layer
First Principles
INTERACTIVE
Columns split across ranks — all-reduce after output projection
A single GPU cannot hold the weights and KV cache of models beyond ~13B parameters at FP16 (26 GiB weights alone). Even for models that fit, the compute of a single GPU may be insufficient to meet latency targets. Distributed execution partitions computation across GPUs. The three axes of parallelism -- Tensor (TP), Pipeline (PP), and Expert (EP) -- address different bottlenecks.
DP-Attention reduces KV memory but adds all-reduce per layer
Communication backend
NCCL all-reduce
Custom (DeepEP, NixlEP)
Custom kernels overlap compute/comm but increase complexity
Tensor Parallelism (TP)
Column-parallel linear. Weight matrix W of shape [d_in, d_out] is split along the output dimension. Rank r receives W[:, rd_out/T : (r+1)d_out/T]. The forward pass computes Y_r = X @ W_r locally with no communication. The output Y_r has shape [B, d_out/T]. For layers that need the full output (e.g., before a row-parallel layer), an all-gather collects [Y_0, ..., Y_{T-1}].
Row-parallel linear. W is split along the input dimension. Rank r receives W[rd_in/T : (r+1)d_in/T, :]. Input X must be split (or all-gathered) so each rank sees its slice. Each rank computes a partial sum Y_r = X_r @ W_r. An all-reduce sums the partial results: Y = sum(Y_r).
MHA head partitioning. For H attention heads with TP degree T, rank r gets heads [rH/T, (r+1)H/T). Each rank computes attention independently on its head subset. QKV projections use column-parallel; output projection uses row-parallel. The all-reduce after the output projection is the sole communication per attention layer.
Communication volume. Each all-reduce on a tensor of size M bytes using ring all-reduce transfers 2 * (T-1)/T * M bytes total across the ring. For hidden dimension d=8192, batch tokens B=4096, FP16:
M = B * d * 2 bytes = 4096 * 8192 * 2 = 64 MiB
Ring all-reduce volume = 2 * (T-1)/T * 64 MiB
For T=8: 2 * 7/8 * 64 = 112 MiB per layer
With 80 layers: 8,960 MiB = 8.75 GiB total all-reduce traffic per forward pass
At NVLink bandwidth of 900 GB/s bidirectional (H100), the per-layer all-reduce time for 112 MiB is 112 MiB / 900 GB/s = ~0.12 ms. With 2 all-reduces per layer (attention + FFN) across 80 layers, the total all-reduce time per forward pass is 0.12 ms x 80 x 2 = ~19 ms. This is acceptable for prefill (where compute takes hundreds of ms) but significant for decode where the total compute time may be only 2-5 ms per layer (~40 ms total across 80 layers at batch=256).
Pipeline Parallelism (PP)
Layers are partitioned into P stages across P GPUs. Stage s holds layers [sL/P, (s+1)L/P). The forward pass sends activations from stage s to stage s+1.
Micro-batching. Without micro-batching, pipeline bubble fraction = (P-1)/P (only one stage active at a time). With M micro-batches, bubble fraction drops to:
bubble = (P-1) / (P-1 + M)
For P=4, M=8: bubble = 3/11 = 27%. For P=4, M=32: bubble = 3/35 = 8.6%. In practice, PP is used primarily to scale beyond a single node's NVLink domain, accepting the latency cost for memory capacity.
Expert Parallelism (EP) for MoE
MoE models (DeepSeek-V3, Mixtral) replace dense FFN layers with a router that selects top-k experts per token. With E experts and EP degree T, each GPU holds E/T experts.
All-to-all dispatch. Each GPU routes its tokens to the GPU hosting the selected expert. This is a permutation communication pattern. The key challenge is load imbalance: if expert popularity is skewed, some GPUs receive disproportionately many tokens.
SGLang provides a rich set of dispatchers under sglang/srt/layers/moe/token_dispatcher/:
StandardDispatcher (standard.py:86): baseline all-to-all using NCCL
DeepEPDispatcher (deepep.py:731): uses DeepEP library for overlapped dispatch with normal and low-latency modes
MoriEPDispatcher (moriep.py:900): optimized EP dispatcher with RDMA-aware routing
NixlEPDispatcher (nixl.py:350): NIXL-based EP for cross-node dispatch
MooncakeEPDispatcher (mooncake.py:286): Mooncake transfer engine integration
FlashinferDispatcher (flashinfer.py:72): FlashInfer-based EP for single-node
Expert capacity factor and load balancing. In standard MoE implementations, each expert has a fixed capacity: the maximum number of tokens it can process per batch is capacity_factor * (total_tokens / num_experts). A typical capacity factor is 1.25, allowing 25% headroom above perfectly uniform routing. When an expert receives more tokens than its capacity, the excess tokens are *dropped* -- routed to a no-op that outputs zeros. Token dropping loses information and degrades output quality. To mitigate this, an auxiliary load-balancing loss is added during training, penalizing routing distributions that deviate from uniform. Without this loss, a small number of experts become disproportionately "popular" while most remain underutilized, creating both quality degradation (from dropping) and compute waste (idle experts). DeepSeek-V3 takes a different approach: it eliminates token dropping entirely, using dynamic allocation where experts process all routed tokens regardless of capacity. This avoids information loss but introduces variable-size all-to-all communication patterns, since the message sizes between GPU pairs depend on the runtime routing decisions rather than being fixed at capacity. The variable communication complicates NCCL collective scheduling and motivates custom dispatchers like DeepEP.
NCCL Tuning and Communication Optimization
NCCL algorithm selection. NCCL supports three collective algorithms, each optimal for different message sizes and topologies:
Ring: data circulates around a ring of GPUs. Bandwidth-optimal (achieves full bisection bandwidth) but latency grows linearly with GPU count. Best for small-to-medium messages on a single node.
Tree: binary tree reduction/broadcast. Latency grows logarithmically with GPU count, but bandwidth utilization is lower. Best for large messages on multi-node clusters where ring latency is prohibitive.
CollNet (SHARP): Scalable Hierarchical Aggregation and Reduction Protocol. Offloads reduction operations to InfiniBand switches themselves, so data is partially reduced in-network before reaching destination GPUs. This halves cross-node traffic for all-reduce (each direction carries half the data since reduction happens in the switch).
These are controlled via NCCL_ALGO=Ring|Tree|CollNet and NCCL_PROTO=Simple|LL|LL128 (protocol: Simple for large messages, LL/LL128 for low-latency small messages). In practice, NCCL auto-selects based on message size and topology, but manual tuning can yield 10-30% improvement for specific workload patterns.
GPU Direct RDMA (GDR). Standard GPU-to-GPU communication across nodes requires: GPU -> CPU memory (PCIe) -> NIC -> network -> NIC -> CPU memory (PCIe) -> GPU. GDR eliminates the CPU staging: the NIC reads from and writes to GPU memory directly via PCIe peer-to-peer access. This requires the NIC and GPU to be on the same PCIe switch (or root complex with ACS disabled). GDR reduces cross-node latency by 2-5 us and is critical for disaggregated prefill/decode architectures where KV cache transfers between prefill and decode GPUs must be low-latency. Enable with NCCL_NET_GDR_LEVEL=SYS (or finer: PIX for same PCIe switch, PXB for same PCIe bus, PHB for same NUMA node).
DP-Attention (SGLang Innovation)
Standard data parallelism replicates the full model and splits the batch. Each DP rank stores KV cache for its batch slice. For long-context workloads, the KV cache is the memory bottleneck, not model weights.
Insight. Instead of each DP rank storing KV for all attention heads (of its batch slice), split the heads across DP ranks. Rank r stores KV for heads [rH/D, (r+1)H/D) but for ALL requests in the batch. This reduces per-GPU KV memory by a factor of D (the DP-Attention degree).
Implementation (sglang/srt/layers/dp_attention.py). compute_dp_attention_world_info (line 237-251) computes the attention TP rank and DP rank from the global TP rank:
The rank layout is (dp, cp, tp) where tp is the fastest-changing dimension. DpPaddingMode controls how variable-length sequences are padded: MAX_LEN pads all sequences to the longest in the batch (simpler but wasteful), SUM_LEN pads to the sum (better utilization).
Trade-off. DP-Attention reduces KV memory linearly with D but adds an all-reduce after every attention layer (to recombine head outputs). For decode-heavy workloads with long contexts (>32K tokens), the KV memory savings dominate. For short-context, high-throughput workloads, the added communication is a net loss.
Scalability considerations. TP scales within a node (typically TP<=8 due to the NVLink domain). Beyond one node, PP and DP add scale: TP=8, PP=4, DP=4 serves 128 GPUs. The communication-to-computation ratio grows with scale: at TP=8, ~19 ms of all-reduce per step vs ~40 ms of compute (for decode at batch=256). Communication is ~32% of step time. At 1000+ GPUs, fault tolerance becomes critical -- neither framework supports mid-generation recovery from GPU failures. A TP rank failure is fatal to the entire TP group.
First-class support with configurable padding modes
Communication overlap
Via CUDA graph and async ops
DeepEP/MoriEP overlap compute and communication
Quiz: Test Your Understanding
Q1 (Symbolic). For a weight matrix W of shape [8192, 8192] with TP=4 using column-parallel partitioning, what slice does rank 2 receive?
Column-parallel splits along the output dimension (dim 1). Each rank gets 8192/4 = 2048 columns. Rank 2 receives W[:, 4096:6144], i.e., all 8192 rows but columns 4096 through 6143.
Q2 (Symbolic). A model has 32 attention heads. With TP=4, which heads does each rank compute?
32 heads / 4 ranks = 8 heads per rank.
Rank 0: heads [0, 8)
Rank 1: heads [8, 16)
Rank 2: heads [16, 24)
Rank 3: heads [24, 32)
Q3 (Calculation). Calculate the all-reduce communication volume for one transformer layer with TP=8, hidden_dim=8192, batch_tokens=4096, FP16. How many such all-reduces occur per layer?
Ring all-reduce volume = 2 * (T-1)/T * M = 2 * 7/8 * 64 = 112 MiB.
Per transformer layer there are typically 2 all-reduces: one after the attention output projection (row-parallel) and one after the FFN second linear (row-parallel). Total per layer = 2 * 112 = 224 MiB.
Q4 (Concept). Why does DP-Attention become increasingly beneficial as context length grows, but harmful for short-context workloads?
KV cache memory scales as O(batch_size * seq_len * num_heads * head_dim). For long contexts, KV memory dominates HBM -- splitting heads across DP ranks directly reduces the per-GPU KV footprint, allowing more concurrent requests or longer sequences.
The cost is one all-reduce per attention layer per forward pass. For short contexts, the attention computation is fast (small seq_len), so the all-reduce latency becomes a significant fraction of total layer time. The KV memory savings are also small (short sequences use little KV), so the trade-off is net negative. The crossover point depends on hardware bandwidth, but typically DP-Attention wins when seq_len exceeds ~8K-16K tokens.
8.
Model Runtime / Executor
First Principles
The runtime layer bridges the model graph and the GPU hardware. Its two critical responsibilities are minimizing kernel launch overhead and selecting the fastest attention backend.
CUDA Graphs from First Principles
Every CUDA kernel launch traverses the CUDA driver stack: argument validation, kernel selection, stream synchronization check, and PCIe/NVLink doorbell. This costs approximately 3-10 microseconds per launch. A single decode step for a 70B model executes 100-300 kernels (attention, layernorm, linear, activation, rotary embedding -- per layer times 80 layers, plus sampling). Total launch overhead:
300 kernels * 5 us/launch = 1,500 us = 1.5 ms
A decode step's total GPU compute time is 2-5 ms for moderate batch sizes. Launch overhead is 30-75% of total time -- the GPU is idle waiting for the CPU to issue the next kernel.
Solution: CUDA Graphs. Record a sequence of kernel launches into a graph object. Replay the entire graph with a single launch call (15-50 us overhead depending on graph complexity -- node count, dependency chains -- regardless of kernel count). Savings:
Before: 300 * 5 = 1,500 us
After: 1 * 30 = 30 us (typical for moderate-complexity graphs)
Savings: 1,470 us per decode step = 1.47 ms
At 500 tokens/second decode rate, that is 0.735 seconds saved per second -- a 2.7-3x throughput improvement for decode-bound workloads.
The catch: CUDA graphs require fixed tensor addresses, shapes, and execution order. Variable batch sizes (the norm in serving) require either (a) capturing multiple graphs for different batch sizes, (b) padding to the nearest captured size, or (c) breaking out of the graph for dynamic sections.
vLLM uses CUDAGraphMode with eager (no graph), static (fully captured), and hybrid modes. It also leverages torch.compile with Inductor fusion passes (vllm/compilation/passes/) that fuse small kernels (e.g., RMS normalization + quantization, attention + quantization) to reduce total kernel count before graph capture.
SGLang implements breakable CUDA graphs: the graph can be exited mid-execution when a dynamic-shape operation is encountered, run that operation eagerly, then re-enter the graph. Piecewise capture subdivides the model into segments that are individually graphed. This achieves most of the graph benefit without requiring fully static shapes.
Attention Backends
The attention computation softmax(QK^T / sqrt(d)) V has O(N^2) memory for the attention matrix if materialized. FlashAttention avoids this by tiling:
FlashAttention-2: Tiles Q and KV into blocks that fit in SRAM. Computes partial softmax per tile, maintains a running max for numerical stability. Memory: O(N) instead of O(N^2). Throughput: near peak FLOPS on A100/H100 for long sequences.
FlashInfer: Optimized for the variable-length batched decode pattern common in serving. Its PageAttention kernel directly indexes into paged KV cache blocks, avoiding the scatter/gather overhead of converting paged layout to contiguous before calling FlashAttention.
vLLM: The attention selector uses a registry-based dispatch pattern (vllm/v1/attention/backends/registry.py), not direct compute-capability branching. Platform detection via cuda.py feeds into the registry, but the selector itself is configuration-driven. The configuration is captured in AttentionSelectorConfig (selector.py:21-47), which considers head size, dtype, kv_cache_dtype, MLA usage, sparse attention, and sink tokens. The registry maps these configurations to backends: on Hopper (H100), FlashInfer is preferred for its paged-decode kernels; on Ampere, FlashAttention-2 is the fallback.
SGLang supports 15+ attention backends including FlashInfer (default on CUDA), FlashAttention, Triton kernels, and domain-specific backends like NSA (Nucleus Sparse Attention) for models with learned sparsity patterns. The backend is selected at server startup based on hardware and model architecture.
FlashAttention vs FlashInfer paged KV handling. FlashAttention requires gathering paged KV blocks into a contiguous buffer before attention (or using block-diagonal masking to simulate contiguity). FlashInfer's BatchDecodeWithPagedKVCacheWrapper natively iterates the page table within the kernel, issuing loads directly from scattered physical pages. This avoids an O(batch * seq_len * head_dim) gather copy, which is significant at high batch sizes -- for batch=256 with 4K context on 128-dim heads, the gather alone is 256 * 4096 * 128 * 2 bytes = 256 MiB of unnecessary HBM traffic per attention layer.
Hopper TMA (Tensor Memory Accelerator). Hopper introduces hardware-accelerated asynchronous bulk copies from global memory to shared memory without occupying CUDA cores. The copy and compute stages are hardware-decoupled: the TMA unit operates independently of the warp scheduler. FlashAttention-3 leverages TMA to achieve higher occupancy: while Tensor Cores compute attention on one tile, TMA prefetches the next KV tile into shared memory. This producer-consumer pipeline is why FA-3 on Hopper outperforms FA-2 significantly -- the Tensor Cores never stall waiting for data, and the memory system is continuously utilized. On Ampere (no TMA), FA-2 must use software-managed shared memory loads, where warps alternate between issuing loads and computing, leaving either the memory system or Tensor Cores idle in each phase.
Key Kernel Fusions for Inference Throughput
The most impactful kernel fusions for LLM serving eliminate HBM round-trips between operations that are naturally sequential:
Fused RMSNorm + residual add + quantize: The baseline executes three kernels: (a) RMSNorm reads hidden states from HBM and writes normalized output, (b) residual add reads normalized output and residual, writes sum, (c) quantize reads FP16 result, writes INT8/FP8. The fused kernel reads hidden states and residual once, computes norm + add + quantize in registers, and writes the quantized output once -- eliminating two full HBM round-trips. In vLLM, this is handled by Inductor fusion passes (rms_quant_fusion.py). In SGLang, the fusion is implemented in the model forward pass.
**Fused SwiGLU gate * up:** The MLP's gating mechanism computes SiLU(x @ W_gate) * (x @ W_up). Without fusion, both gate and up projections are materialized to HBM as separate tensors, then an elementwise kernel reads both and writes the product. The fused kernel performs the elementwise multiply in registers immediately after the matmuls, avoiding materializing two full [batch, intermediate_dim] tensors.
Fused RoPE in attention: FlashInfer fuses rotary positional embedding directly into the attention kernel -- Q and K are rotated in registers as they are loaded, before the dot product. FlashAttention-2 requires a separate RoPE kernel that reads Q/K from HBM, applies rotation, and writes back, before the attention kernel reads them again. For long sequences, this extra read-write pass is non-trivial.
Fused top-k/top-p sampling: SGLang's sgl-kernel fuses the sort + cumulative sum + threshold comparison into a single pass over the logit distribution. vLLM uses separate PyTorch ops (sort, cumsum, masking, multinomial), each launching a separate kernel with its own HBM traffic.
Sampling Pipeline
After the final layer produces logits of shape [batch, vocab_size], the sampling pipeline converts them to token IDs:
vLLM (vllm/v1/sample/sampler.py:21-60):
Optional raw logprobs computation (for API response)
Sample via TopKTopPSampler: temperature scaling, min_p, top_k, top_p, multinomial sample
Gather top-k logprobs for response
SGLang (sglang/srt/layers/sampler.py:41-100):
Custom logit processors + NaN detection
Greedy fast-path: if all requests are greedy, torch.argmax directly
Temperature scaling
Top-k renormalization via sgl_kernel.top_k_renorm_prob (AOT CUDA kernel)
Top-p renormalization via sgl_kernel.top_p_renorm_prob
Sampling via FlashInfer's top_k_top_p_sampling_from_probs or min_p_sampling_from_probs
SGLang's use of sgl-kernel for renormalization and FlashInfer for sampling avoids Python-side probability manipulation, keeping the hot path in fused CUDA kernels.
Comparison
Dimension
vLLM
SGLang
CUDA graphs
Static capture, torch.compile fusion
Breakable/piecewise graphs
Attention backends
Platform-aware auto-selection
15+ backends, manual or auto
Sampling kernels
PyTorch + custom TopKTopP
sgl-kernel AOT + FlashInfer
Kernel fusion
Inductor IR passes (compilation/passes/)
JIT kernel library (jit_kernel/)
Greedy fast-path
Part of unified sampler
Separate torch.argmax branch
Quiz: Test Your Understanding
Q1 (Calculation). A decode step has 250 kernel launches at 5 us each. With CUDA graphs (30 us typical replay overhead for a moderate-complexity graph), what is the absolute time saved? If total decode step time was 4 ms, what is the relative speedup?
Without graphs: 250 * 5 = 1,250 us launch overhead. With graphs: 30 us. Savings: 1,220 us.
Total time without: 4,000 us. Total time with: 4,000 - 1,250 + 30 = 2,780 us.
Speedup: 4,000 / 2,780 = 1.44x (44% faster).
Q2 (Concept). Why can CUDA graphs not handle variable batch sizes, and what are the two workarounds?
CUDA graphs record the exact kernel arguments (tensor pointers, shapes, strides) at capture time. When batch size changes, tensor shapes change, which means different memory addresses, different grid dimensions, and potentially different kernel selections. Replaying the graph with wrong shapes produces incorrect results or crashes.
Workaround 1: Capture multiple graphs for discrete batch sizes (e.g., powers of 2: 1, 2, 4, 8, ..., 256). Pad the actual batch to the nearest captured size. Cost: memory for multiple graphs + compute waste from padding.
Workaround 2: Breakable/piecewise graphs (SGLang's approach). Capture graph segments that have fixed shapes (e.g., per-layer compute with fixed max-batch padding) and exit the graph for dynamic operations (e.g., variable-length attention). Re-enter the graph afterward.
Q3 (Concept). FlashAttention achieves O(N) memory instead of O(N^2) for attention. Where does the N^2 come from in naive attention, and how does tiling eliminate it?
Naive attention materializes the full [N, N] attention score matrix S = Q @ K^T in HBM before applying softmax and multiplying by V. This matrix has N^2 elements, each consuming 2-4 bytes.
FlashAttention tiles Q into blocks of size B_q and KV into blocks of size B_kv. For each Q-block, it iterates over KV-blocks, computing the partial [B_q, B_kv] attention scores in SRAM, accumulating the output and maintaining a running softmax denominator. At no point is the full [N, N] matrix materialized. The only HBM storage is the output [N, d] and the per-row log-sum-exp for the backward pass -- both O(N). The block sizes B_q, B_kv are chosen to fit in SRAM (typically 64-128).
9.
Quantization + Kernel Backend
First Principles
LLM inference is memory-bandwidth-bound during decode: the GPU must read the entire weight matrix for each token generated, but each token only performs one matmul per weight element (arithmetic intensity ~1). Reducing weight precision from FP16 (2 bytes) to INT4 (0.5 bytes) cuts memory bandwidth by 4x, directly translating to ~4x decode throughput for bandwidth-bound workloads.
Why quantization works. Trained weight distributions are approximately Gaussian with most values clustered near zero. The information content per weight is far less than 16 bits. Empirically, 4-bit quantization with group-wise scaling preserves >99% of model quality for most tasks.
INT4 group quantization (group_size=128). Group 128 consecutive weights. Compute scale = max(abs(group)) / 7 (for symmetric 4-bit: range [-7, 7]). Quantize: w_int4 = round(w_fp16 / scale). Dequantize: w_fp16 = w_int4 * scale. Scale factor overhead: 1 FP16 value (2 bytes) per 128 weights = 0.015625 additional bytes per weight.
Model size calculation. 70B parameters at FP16 = 70B * 2 = 140 GiB. At INT4 with group_size=128:
Weight bytes = 70B * 0.5 bytes = 35 GiB
Scale bytes = (70B / 128) * 2 bytes = 1.09 GiB
Total = 36.09 GiB (fits in a single 80 GiB GPU)
AWQ (Activation-aware Weight Quantization): Observes that a small fraction of weight channels are "salient" -- they correspond to large activation magnitudes and disproportionately affect output quality. AWQ protects these channels by scaling them up before quantization (equivalently, absorbing the scale into preceding layers), then using standard group quantization. This preserves the critical channels at the cost of slightly worse quantization for non-salient channels.
GPTQ: Uses second-order (Hessian) information to quantize weights one column at a time, redistributing quantization error to not-yet-quantized columns. This is optimal in the L2 sense for each column given the Hessian of the layer's reconstruction loss. More expensive to calibrate than AWQ but often yields better perplexity at the same bit width.
Marlin kernels: Hand-optimized GPU kernels that fuse INT4 dequantization with matrix multiplication. Instead of dequantizing weights to FP16 in a separate kernel and then calling cuBLAS, Marlin reads INT4 weights, dequantizes in registers, and immediately accumulates in FP16/FP32. This eliminates the intermediate FP16 weight tensor from HBM, achieving near-peak memory bandwidth utilization.
Tensor Core data type requirements and hardware generations. Not all quantization formats map directly to Tensor Core operations:
INT4 weight-only quantization (AWQ, GPTQ): Weights are stored as INT4 in HBM, but computation happens in FP16 or FP32 on Tensor Cores after software dequantization. The Tensor Core itself does not perform INT4 matrix multiply -- the INT4 format saves memory bandwidth (reading 0.5 bytes instead of 2 bytes per weight), but the arithmetic throughput is identical to FP16. The speedup comes entirely from being memory-bandwidth-bound: 4x less data to read = up to 4x faster.
FP8 (E4M3 for forward, E5M2 for gradients): Requires Hopper's 4th-generation Tensor Cores, which natively support FP8 matrix multiply-accumulate. On Hopper, FP8 achieves 2x the FLOPS of FP16 (3,958 TFLOPS vs 1,979 TFLOPS on H100 SXM). FP8 Tensor Cores are *not available on Ampere* -- on A100, FP8 weights must be dequantized to FP16 before hitting Tensor Cores, providing bandwidth savings but no compute advantage.
FP4 (E2M1): Native Tensor Core support exists only on Blackwell (B100/B200), where FP4 achieves 2x the FLOPS of FP8. On Hopper, FP4 requires software dequantization to FP8 before reaching Tensor Cores. This means FP4 on Hopper saves memory bandwidth (half the bytes of FP8) but does *not* provide 2x compute throughput over FP8 -- the Tensor Cores still operate at FP8 speed. The practical benefit on Hopper is fitting larger models in HBM and reducing memory-bandwidth bottlenecks, not increasing arithmetic throughput.
vLLM
vLLM supports 12+ quantization formats under vllm/model_executor/layers/quantization/: AWQ (awq.py), AWQ-Marlin (awq_marlin.py), GPTQ (gptq.py), GPTQ-Marlin (gptq_marlin.py), FP8 (fp8.py), MXFP4 (mxfp4.py), BitsAndBytes (bitsandbytes.py), ModelOpt (modelopt.py), TorchAO (torchao.py), and more. The base_config.py defines the QuantizationConfig interface that all formats implement.
Kernel fusion is handled through Inductor compilation passes (vllm/compilation/passes/). These operate on the torch.compile IR to identify and fuse patterns like RMSNorm+Quantize, Attention+Quantize, and MLA+Attention+Quantize into single kernels, reducing launch overhead and intermediate memory traffic.
SGLang
SGLang takes a two-tier approach:
AOT kernels (sgl-kernel): Performance-critical kernels are implemented as ahead-of-time compiled CUDA/C++ extensions in the sgl-kernel package. These include top_k_renorm_prob, top_p_renorm_prob, and quantization primitives, compiled once at install time for maximum performance.
JIT kernels (sglang/jit_kernel/): A library of lightweight Triton and CUDA kernels compiled at first use. Quantization-related kernels include:
per_token_group_quant_8bit.py: per-token group INT8 quantization
The JIT approach enables hardware-specific tuning (kernel parameters are auto-tuned for the current GPU) without requiring a heavy build infrastructure.
Comparison
Dimension
vLLM
SGLang
Supported formats
12+ (broadest coverage)
Focused set + sgl-kernel custom ops
Kernel strategy
Marlin + Inductor fusion passes
sgl-kernel AOT + Triton JIT
FP8 support
Full (E4M3, E5M2, per-tensor, per-channel)
Full + per-token group quantization
INT4 support
AWQ, GPTQ, AWQ-Marlin, GPTQ-Marlin
AWQ (JIT dequantize), GPTQ-Marlin
MXFP4/NV-FP4
Yes (mxfp4.py)
Yes (nvfp4.py JIT kernel)
Kernel auto-tuning
Via torch.compile Inductor
Triton auto-tuner in JIT kernels
When vLLM wins: When you need to deploy a model in an unusual quantization format (BitsAndBytes, GGUF, TorchAO, etc.). vLLM's breadth of format support is unmatched.
When SGLang wins: When peak kernel performance matters more than format coverage. The sgl-kernel AOT path avoids JIT compilation latency and Python dispatch overhead for the most critical operations.
Quiz: Test Your Understanding
Q1 (Calculation). A 70B-parameter model uses INT4 quantization with group_size=128 and FP16 scales. (a) What is the total model size? (b) What is the effective bits-per-weight including scale overhead? (c) How much HBM is saved compared to FP16?
Q2 (Concept). Why is per-channel quantization (one scale per output channel) better than per-tensor quantization (one scale for the entire weight matrix)?
Per-tensor quantization uses a single scale = max(abs(W)) / max_int for the entire matrix. If even one outlier weight is large, the scale is large, and all other weights (which are near zero) are quantized with poor resolution -- they all map to 0 or +/-1 in INT4.
Per-channel quantization computes a separate scale for each output channel (row of the weight matrix). Channels with small weight magnitudes get a small scale (high resolution), and channels with outliers get a large scale (lower resolution but only affecting that channel). The quantization error is distributed more evenly. Empirically, per-channel INT8 matches FP16 quality for most models, while per-tensor INT8 shows measurable degradation.
The cost is storing one scale per channel instead of one per tensor -- negligible for typical hidden dimensions (8192 scales * 2 bytes = 16 KiB vs 2 bytes).
Q3 (Concept). Why do Marlin kernels achieve higher throughput than the naive "dequantize to FP16 then cuBLAS GEMM" approach?
The naive approach has two passes over the weight data:
Dequantize kernel: reads INT4 weights from HBM, writes FP16 weights to HBM (doubles memory traffic)
cuBLAS GEMM: reads FP16 weights from HBM for the matmul
Total HBM reads for weights: 0.5 bytes (INT4 read) + 2 bytes (FP16 write) + 2 bytes (FP16 read) = 4.5 bytes per weight. This is worse than just reading FP16 (2 bytes)!
Marlin fuses dequantization into the GEMM: reads INT4 weights once from HBM (0.5 bytes), dequantizes in registers (free -- register bandwidth is ~100x HBM bandwidth), and immediately accumulates the matmul. Total HBM reads: 0.5 bytes per weight. This achieves the full 4x bandwidth reduction that quantization promises.
10.
Speculation Layer
First Principles
INTERACTIVE
Click "Run" to animate draft→verify
Autoregressive decoding is fundamentally sequential: generating token $i$ requires the hidden states produced while generating token $i-1$. For a sequence of $N$ tokens, this means $N$ serial forward passes through the entire model. Each pass is memory-bandwidth-bound during decode (batch size per sequence = 1 token), meaning the arithmetic units sit mostly idle while weights stream from HBM. The GPU processes one token but pays the full cost of loading every parameter.
The core insight behind speculative decoding is that verification is cheaper than generation. If a cheap draft model proposes $K$ candidate tokens, the expensive target model can verify all $K$ in a single forward pass because the attention over those $K$ candidates is fully parallelizable (each candidate at position $i$ attends to the prefix plus candidates $0..i-1$). When the draft model has acceptance rate $\alpha$ per token, we get $K$ tokens verified for approximately 2 forward-pass equivalents (one draft, one verify), yielding a speedup approaching $K/2$ in the best case.
Expected accepted tokens (linear chain). With $K$ draft tokens each accepted independently with probability $\alpha$, the number of accepted tokens before the first rejection follows a truncated geometric distribution. The expected count is:
Note this is *not* $\alpha/(1-\alpha)$ (the untruncated geometric mean) -- the truncation to $K$ terms matters. For $\alpha=0.8, K=5$: $E = (1 - 0.8^5)/(1 - 0.8) = (1 - 0.32768)/0.2 = 3.36$ tokens per verify step, plus one bonus token from the target's own sampling at the first rejected position.
Design Space
Decision
Option A
Option B
Trade-off
Draft source
Separate small model
Target model's own heads (EAGLE/MTP)
Separate model = more flexible, EAGLE = no extra weights to load
Token structure
Linear chain
Tree (branching)
Tree explores more paths but needs custom attention masks
Rejection method
Strict (argmax match)
Probabilistic (distribution-preserving)
Strict = simpler, probabilistic = exact target distribution
Draft-verify overlap
Sequential
Pipelined (V2)
Pipelining hides draft latency but adds implementation complexity
Speculation depth ($K$)
Small (3-5)
Large (8-16)
Larger $K$ = more compute per step, but diminishing returns if $\alpha$ is low
The Draft-Verify Paradigm
Strict rejection. The simplest method: the target model samples independently at each draft position. If its sample matches the draft token, accept; otherwise reject and discard all subsequent draft tokens. This does NOT preserve the target distribution exactly when using stochastic sampling (it biases toward tokens the draft model also predicts), but it is correct for greedy decoding.
Probabilistic rejection sampling (distribution-preserving). For each draft position with draft token $d$:
Compute acceptance probability: $a = \min\!\bigl(1,\; p(d) / q(d)\bigr)$, where $p$ is the target probability and $q$ is the draft probability.
Draw $u \sim \text{Uniform}(0,1)$. If $u < a$, accept $d$.
If rejected, sample a correction token from the residual distribution: $r(x) \propto \max(0,\; p(x) - q(x))$.
This guarantees the output distribution is *identical* to sampling from the target model alone -- there is zero quality degradation. The proof follows from the fact that the marginal probability of emitting any token $x$ is $q(x) \cdot \min(1, p(x)/q(x)) + (1 - \sum_y q(y)\min(1,p(y)/q(y))) \cdot r(x) = p(x)$.
vLLM implementation.RejectionSampler in vllm/v1/worker/gpu/spec_decode/rejection_sampler.py supports three modes selected by rejection_sample_method:
strict: Triton kernel _strict_rejection_sample_kernel (line 24). One Triton program per request. Iterates draft positions, stores the target's sample, and sets rejected=True on first mismatch. Clean and branchless inside the GPU warp.
probabilistic: Delegates to probabilistic_rejection_sample() which processes target logits and draft logits jointly, computing the acceptance ratio per token with temperature-adjusted distributions.
synthetic: For testing/benchmarking. Uses compute_synthetic_rejection_sampler_params() to derive a base acceptance rate and decay factor, then simulates acceptance without real draft logits.
Implementation note: vLLM has TWO RejectionSampler classes: (1) vllm/v1/sample/rejection_sampler.py:30 (nn.Module, used during normal sampling with spec decode) and (2) vllm/v1/worker/gpu/spec_decode/rejection_sampler.py:100 (worker-level, contains the Triton kernels). The worker-level sampler contains the actual Triton acceptance kernels; the sample-level one wraps it with logprob handling.
Tree-Structured Speculation
A linear chain wastes all tokens after the first rejection. Tree speculation hedges by branching: at each depth level, multiple candidate tokens are proposed. If the top-1 at depth 2 is rejected, the top-2 might still match.
Tree attention masks. Each node in the tree attends only to its ancestors. For a tree with branching factor $[3, 2]$ (3 children at depth 1, 2 grandchildren per child at depth 2), there are $3 + 6 = 9$ candidate positions, but each depth-2 node attends to only 3 prior tokens (root + its depth-1 parent + itself), not all 9.
vLLM tree implementation. The EagleProposer (in vllm/v1/spec_decode/eagle.py) parses speculative_token_tree into tree_choices -- a list of tuples defining the tree topology (line 275). It precomputes cu_drafts_per_level (cumulative node counts per depth) and child_drafts_per_level (branching factor per level) at lines 278-289. Tree attention is handled by TreeAttentionMetadataBuilder which constructs the causal mask restricting each query to its ancestor path.
EAGLE: Draft from Target Hidden States
EAGLE avoids training or loading a separate draft model entirely. The key observation: the target model's last hidden state at position $i$ already encodes rich information about token $i+1$. An EAGLE draft head takes this hidden state, applies a lightweight transformation (typically 1-2 transformer layers), and predicts the next token.
EAGLE3 extends this by incorporating auxiliary hidden states from intermediate layers, improving prediction quality at the cost of slightly more draft computation.
SGLang's EAGLE implementation is the most mature in the ecosystem:
EAGLEWorker (sglang/srt/speculative/eagle_worker.py): The V1 worker. Extends TpModelWorker, shares the token-to-KV-pool allocator with the target worker (line 115-117). Initializes the draft model with the target's embedding and LM head via get_embed_and_head() (line 158). Supports both EAGLE and EAGLE3 (lines 120-176), with EAGLE3 models sometimes sharing the LM head from the target.
EAGLEWorkerV2 (sglang/srt/speculative/eagle_worker_v2.py): Overlaps draft generation with target verification. While the target verifies batch $t$, the draft model speculatively generates candidates for batch $t+1$ using the predicted hidden states. This pipelining hides the draft model's latency almost entirely.
EAGLEDraftCudaGraphRunner (sglang/srt/speculative/eagle_draft_cuda_graph_runner.py): CUDA graph capture for the draft model's forward pass, eliminating kernel launch overhead that would otherwise dominate the lightweight draft computation.
organize_draft_results() in eagle_utils.py (line 19): Takes per-step score, token, and parent lists, flattens and sorts by top-K scores to select the best num_draft_token - 1 candidates across all tree branches.
build_tree_kernel_efficient() (line 47): Constructs the verification tree mask. Supports three modes via TreeMaskMode: FULL_MASK (dense boolean), QLEN_ONLY (query-length dimension only), and QLEN_ONLY_BITPACKING (packed bits for memory efficiency with large trees).
Multi-Token Prediction (MTP / Parallel Drafting)
MTP adds multiple prediction heads branching off the final transformer layers. Head $i$ predicts the token at position $+i$ relative to the current position. All heads execute in a single forward pass, producing $K$ candidate tokens simultaneously without any autoregressive draft steps.
vLLM's parallel drafting. The SpecDecodeBaseProposer (line 87-96 of eagle.py) sets parallel_drafting=True and computes extra_slots_per_request = num_speculative_tokens (each request needs $K$ additional KV cache slots for the draft positions). The _init_parallel_drafting_params() method (line 317) reads a mask token ID from the draft model config -- this padding token fills positions where the parallel head has no meaningful input, used in models like DFlash that natively support this mode.
Comparison
Dimension
vLLM
SGLang
Primary method
EAGLE + tree attention
EAGLE + tree via sgl_kernel
Rejection sampling
Triton kernels (strict, probabilistic, synthetic)
Custom CUDA verification in build_tree_kernel_efficient
When vLLM wins: Probabilistic rejection sampling for quality-sensitive workloads; MTP/parallel drafting for models that natively support it; broader hardware backend for spec decode (ROCm).
When SGLang wins: EAGLE V2 overlap hides draft latency nearly for free; the bitpacked tree mask scales to larger trees with less memory; tighter integration between draft and target through shared KV pools.
Quiz: Test Your Understanding
Q1 (Calculation). For a linear speculation chain with $K=5$ draft tokens and per-token acceptance rate $\alpha=0.8$, compute the expected number of accepted tokens (excluding the bonus token from target sampling at the rejection point).
Including the bonus token sampled by the target at the first rejected position, you get approximately 4.36 tokens per verification step.
Q2 (Concept). In probabilistic rejection sampling, why do we sample from $r(x) \propto \max(0, p(x) - q(x))$ when the draft token is rejected, rather than simply sampling from $p(x)$?
If we sampled from $p(x)$ on rejection, we would double-count tokens where $p(x) > q(x)$: they already had a chance of being accepted in the acceptance step. The residual distribution $r(x) \propto \max(0, p(x) - q(x))$ compensates for the probability mass already "consumed" by the acceptance step. This ensures the marginal distribution over emitted tokens is exactly $p(x)$ -- the target distribution -- regardless of how poor the draft model is.
Q3 (Calculation). A tree speculation setup uses branching factor $[3, 2, 2]$ (3 children at depth 1, 2 at depth 2, 2 at depth 3). How many total candidate tokens are verified? If each edge has independent acceptance probability $\alpha=0.7$, what is the probability that at least one path of depth 3 is fully accepted?
Each depth-3 path requires 3 consecutive acceptances: $P(\text{one path}) = 0.7^3 = 0.343$. There are $3 \times 2 \times 2 = 12$ such paths. $P(\text{no path accepted}) = (1 - 0.343)^{12} = 0.657^{12} \approx 0.0085$. Thus $P(\text{at least one}) \approx 1 - 0.0085 = 0.9915$. However, the paths are NOT independent (they share prefixes). A more precise calculation: $P(\text{depth-1 node survives}) = 0.7$. Given survival, each depth-2 child survives with $0.7$, and given that, each depth-3 grandchild with $0.7$. $P(\text{no depth-3 from one depth-1 node}) = 1 - (1-(1-0.7^2))... $ The exact calculation requires inclusion-exclusion on the shared prefixes, but the tree structure means the answer is lower than the naive independent bound.
Exact calculation accounting for shared prefixes. Work bottom-up through the tree. A subtree "fails" if the root edge is rejected OR the root edge is accepted but all child subtrees fail:
Depth-2 subtree (depth-2 node to depth 3). Each depth-2 node has 2 children. A depth-2 subtree fails if the edge to the depth-2 node is rejected, or the edge is accepted but both depth-3 children are rejected:
Depth-1 subtree (depth-1 node to depth 3). Each depth-1 node has 2 depth-2 children. A depth-1 subtree fails if the edge to the depth-1 node is rejected, or the edge is accepted but both depth-2 subtrees fail:
Root to all 3 subtrees. The 3 depth-1 subtrees are independent (no shared edges):
$P(\text{all 3 fail}) = 0.392^3 \approx 0.060$.
Corrected answer: $P(\text{at least one depth-3 path fully accepted}) = 1 - 0.060 \approx \mathbf{0.940}$.
Q4 (Design). Explain why EAGLE V2's overlap strategy does not violate the data dependency between draft and verify phases.
V2 overlaps draft generation for batch $t+1$ with target verification of batch $t$. The draft for $t+1$ uses the *predicted* hidden states from the draft model (not the verified hidden states from the target). When verification of batch $t$ completes and some draft tokens are rejected, V2 discards the stale draft candidates for those requests and regenerates them. The overlap is speculative at the draft level as well -- if the hidden state prediction is good (which EAGLE's design optimizes for), most draft work is retained. The target model's output is never affected by the overlap.
11.
Memory / Storage Hierarchy
First Principles
The memory wall dictates everything about LLM serving economics. An H100 has 80 GB of HBM3 at 3.35 TB/s. A 70B parameter model in FP16 occupies 140 GB -- it does not fit on a single GPU. With TP=2, each GPU holds ~70 GB of weights, leaving roughly 10 GB for KV cache, activations, and framework overhead.
KV cache budget calculation. For a model with $L=80$ layers, $H=8$ KV heads (GQA), head dimension $d=128$, FP16 (2 bytes):
Per-token KV: $2 \times L \times H \times d \times 2 = 2 \times 80 \times 8 \times 128 \times 2 = 327{,}680$ bytes $\approx 320$ KB.
With 7 GB available: $7 \times 1024^3 / 327{,}680 \approx 22{,}900$ tokens of KV cache total across all active requests.
At 4K context length, that supports only ~5 concurrent requests. This is why tiered memory matters: moving cold KV entries to CPU or storage can 10x the effective cache capacity.
Design Space
Decision
Option A
Option B
Trade-off
Number of tiers
2 (GPU + CPU)
3+ (GPU + CPU + storage)
More tiers = more capacity but higher complexity + latency for cold hits
Transfer mechanism
Synchronous cudaMemcpy
Async JIT CUDA kernels
Async avoids blocking compute stream
Eviction policy
LRU
Hit-count threshold
Threshold avoids thrashing with bursty access patterns
CPU memory type
Pageable
Pinned (page-locked)
Pinned = faster DMA but reduces host memory available to OS
GPU copy engines (DMA engines) and compute-copy overlap. GPU copy engines are dedicated hardware units separate from compute SMs. They handle memory transfers (host-to-device, device-to-host, device-to-device) independently, meaning the SMs can execute kernels while a copy engine simultaneously transfers data. cudaMemcpyAsync on a non-default stream enables this true compute-copy overlap: the SMs run inference kernels on the compute stream while the copy engine transfers KV cache blocks on a separate transfer stream.
Pinned (page-locked) host memory is required for async DMA transfers because copy engines perform physical DMA and cannot handle virtual memory page faults. If the source/destination is pageable memory, the CUDA driver must first copy it to an internal pinned staging buffer before the DMA engine can transfer it -- effectively doubling the transfer time and serializing the operation. This is why both vLLM's CuMemAllocator (which backs up GPU memory to CPU tensors during sleep mode) and SGLang's HostKVCache use pin_memory=True for their CPU-side buffers: it ensures that KV cache offloading and loading can overlap with GPU compute without stalling either the SMs or the copy engine.
vLLM
vLLM implements a two-tier system: GPU HBM and CPU pinned memory.
CuMemAllocator (vllm/device_allocator/cumem.py): A custom CUDA memory allocator built on cuMemCreate/cuMemMap virtual memory APIs. It intercepts PyTorch's allocation path via CUDAPluggableAllocator (line 71), routing through my_malloc/my_free C entry points.
The key feature is sleep/wake mode: when a vLLM instance is idle (e.g., waiting for requests in a multi-tenant deployment), sleep() unmaps GPU memory and backs it up to CPU tensors (AllocationData.cpu_backup_tensor, line 55). wake() re-creates GPU allocations and restores from the CPU backup. This enables GPU memory sharing across multiple model instances without reloading weights from disk.
SimpleCPUOffloadConnector: Manages explicit KV cache offloading to CPU pinned memory. Straightforward two-tier design -- blocks are either on GPU or CPU, with synchronous transfer on cache miss.
SGLang's Approach: HiCache (Hierarchical Cache)
SGLang implements a multi-tier caching system with three or more levels.
Tier 1: GPU (hot). Standard paged KV cache in HBM. Managed by the radix tree prefix cache.
Tier 2: CPU pinned (warm).HostKVCache (sglang/srt/mem_cache/memory_pool_host.py). Pinned memory pool for recently evicted KV entries.
Tier 3: Storage backends (cold).HiCacheStorageConfig and the abstract HiCacheStorage base class (sglang/srt/mem_cache/hicache_storage.py) define a pluggable storage interface. Concrete backends include:
NIXL (hicache_nixl.py): RDMA-based transfer for multi-node KV sharing.
Mooncake: Object store backend for persistent KV across restarts.
LMCache: External caching service integration.
Async transfer via JIT CUDA kernels.transfer_hicache_one_layer() in sglang/jit_kernel/hicache.py (line 73) is a JIT-compiled CUDA kernel for layer-granularity KV transfer. The kernel is parameterized by element_size, unroll factor, and block_quota (how many CUDA blocks to use -- lower values reduce interference with compute kernels). It supports both standard KV layout and MLA (Multi-head Latent Attention) layout via separate run_one_mla / run_all_mla entry points.
Tier migration policies. In HiRadixCache (sglang/srt/mem_cache/hiradix_cache.py):
write_through_threshold (default 1 for write-through, 2 for write-back): A node is backed up to host memory only after hit_count >= write_through_threshold (line 667). This avoids wasting bandwidth on one-shot prefills.
load_back_threshold (default 10): Host-to-GPU transfer only triggers if the number of pages to load exceeds this threshold (line 920), batching small loads to amortize transfer overhead.
batch_exists_v2() in HiCacheStorage checks prefix hits across all tiers simultaneously, using PoolHitPolicy (line 46-54) to handle different pool semantics (e.g., Mamba states only need trailing pages, not the full prefix).
Comparison
Dimension
vLLM
SGLang
Tiers
2 (GPU + CPU)
3+ (GPU + CPU + storage backends)
GPU memory management
Custom cuMemCreate/cuMemMap allocator
Standard PyTorch + JIT transfer kernels
Sleep/wake
Yes (unmap GPU, backup to CPU)
No equivalent
Storage backends
None
NIXL (RDMA), Mooncake, LMCache, HF3FS
Transfer granularity
Block-level sync
Layer-level async JIT kernels
Migration policy
Simple LRU eviction
Hit-count thresholds + batch-size gating
Emerging: CXL memory pooling. CXL 3.0 enables cache-coherent memory sharing across hosts with load/store semantics, potentially replacing RDMA-based KV transfer with simpler shared-memory access. Neither framework supports CXL today, but it represents a natural evolution for multi-node KV caching.
Quiz: Test Your Understanding
Q1 (Calculation). You have an 80 GB H100 GPU running a 70B model with TP=2. Model weights consume 70 GB per GPU, activations take 2 GB, framework overhead is 1 GB. The model has 80 layers, 8 GQA KV heads, head dim 128, FP16 KV. How many tokens of KV cache fit? How does switching to FP8 KV change this?
Q2 (Concept). Why does SGLang's write_through_threshold=2 (write-back mode) make sense for serving workloads with many unique prefills but few cache hits?
If most prefills have unique prompts, their KV entries will be evicted before being reused. Writing them to CPU on first access wastes PCIe bandwidth. By requiring hit_count >= 2, only entries that demonstrate reuse value are backed up. This is analogous to a frequency-based cache admission policy (like TinyLFU) rather than a pure write-through that admits everything.
12.
Observability + Control Plane
First Principles
An inference server that cannot be observed cannot be operated. The core challenge: LLM serving has multiple latency components that interact nonlinearly. A queue depth of 10 might be fine at short context lengths but catastrophic at 32K contexts because each request consumes 10x more KV cache, triggering preemption cascades. Without per-component metrics, operators cannot distinguish between "GPU compute bound" and "memory thrashing."
Key Metrics and What They Reveal
Metric
What it measures
Why it matters
TTFT
Time from request arrival to first output token
Dominated by prefill time + queue wait. User-perceived responsiveness.
Memory pressure indicator. At >90%, preemption starts.
Queue depth
Number of waiting requests
Leading indicator of latency spikes.
Batch size (running)
Requests currently in a decode batch
Higher = better GPU utilization, but each addition increases per-iteration latency.
The "healthy but degraded" trap. A server can be live (process running), ready (model loaded), and even pass shallow health checks while delivering terrible latency because its KV cache is thrashing. Deep health checks that perform end-to-end generation are the only way to catch this, at the cost of consuming GPU cycles.
vLLM
Prometheus metrics (vllm/v1/metrics/prometheus.py): Sets up multiprocess-safe Prometheus export via setup_multiprocess_prometheus() with a temporary directory for the PROMETHEUS_MULTIPROC_DIR multiprocess collector (line 25). Exposes histograms for TTFT, TPOT, and counters for tokens generated, requests completed.
ORCA headers: Load-balancer integration headers that expose real-time capacity information (queue depth, running batch size) in HTTP response headers, allowing L7 load balancers to make routing decisions without polling a separate metrics endpoint.
OpenTelemetry tracing: Distributed tracing support for tracking request lifecycle across API gateway, scheduler, and model workers.
RequestLogger: Per-request structured logging with timing breakdowns.
SGLang
Prometheus middleware (add_prometheus_middleware in sglang/srt/entrypoints/http_server.py): FastAPI middleware that wraps each request with Prometheus timing and counter instrumentation.
Custom tracing (sglang/srt/observability/trace.py): SGLang's own tracing framework for fine-grained scheduler-level event tracking.
Watchdog (sglang/srt/utils/watchdog.py): A factory-pattern monitor created via Watchdog.create() (base class at line 19), which returns a _WatchdogReal instance (line 47). The watchdog runs a background thread (not a subprocess) that detects hung workers. If a worker fails to heartbeat within a configurable timeout, the watchdog kills it and triggers recovery. This is critical for production deployments where NCCL hangs or GPU errors can silently deadlock a worker.
PyTorch profiler integration: Built-in hooks for capturing torch.profiler traces for kernel-level performance analysis.
Expert distribution recording: For MoE models, tracks which experts are activated and how load is distributed -- essential for diagnosing MoE load imbalance.
Comparison
Dimension
vLLM
SGLang
Metrics export
Prometheus (multiprocess-safe)
Prometheus (FastAPI middleware)
Load balancer integration
ORCA headers
Not built-in
Distributed tracing
OpenTelemetry
Custom trace framework
Failure detection
Process-level
Watchdog (background thread) with heartbeat
Profiling
External tooling
Built-in PyTorch profiler hooks
MoE-specific
No
Expert distribution recording
Fault tolerance gap. Neither framework handles GPU failure gracefully during active generation. A TP rank failure is unrecoverable -- the entire TP group must restart, losing all in-flight requests. PP stage failures cause request timeouts. In disaggregated mode, a prefill-instance failure loses uncommitted KV cache, requiring re-prefill. At 1000+ GPU scale (where per-GPU MTBF is days), this is a significant production concern. Request-level checkpointing and redundant expert placement for MoE are open problems.
Quiz: Test Your Understanding
Q1 (Concept). A serving cluster shows TTFT p50 = 200ms but p99 = 8s. KV cache utilization averages 60%. What is the most likely cause, and which additional metric would confirm it?
The 40x gap between p50 and p99 TTFT with moderate average KV utilization suggests bursty arrivals causing queue buildup. During bursts, requests queue behind long-context prefills that temporarily spike KV usage and block scheduling. The confirming metric is queue depth over time (not average, but p99 queue depth). A time-series showing queue depth spikes correlating with p99 TTFT spikes would confirm. Additionally, checking the prefill token count distribution would reveal whether a few long-context requests are monopolizing prefill time.
Q2 (Concept). Why is a liveness probe insufficient for production LLM serving? Describe a failure mode that passes liveness but causes complete service degradation.
An NCCL deadlock on one GPU in a TP group: the process is alive (liveness passes), the HTTP server responds to health checks (readiness passes), but any request requiring GPU computation hangs indefinitely because the all-reduce in the attention layer never completes. The only detection methods are (1) deep health checks that run an actual generation request, or (2) a watchdog that monitors per-iteration heartbeats (SGLang's Watchdog approach).
13.
Hardware / Fabric Layer
First Principles
The inference framework must map its logical parallelism strategies onto physical hardware topology. A suboptimal mapping can degrade throughput by an order of magnitude -- for example, running tensor parallelism over PCIe instead of NVLink increases all-reduce latency by ~14x, turning a compute-bound workload into a communication-bound one.
GPU Memory Hierarchy (Key Numbers for H100 SXM)
Level
Capacity
Bandwidth
Latency
Role in LLM Serving
Register file
~256 KB/SM (132 SMs)
N/A (same cycle)
0 cycles
Accumulator registers for matmul, softmax
L1 / SMEM
192 KB/SM (configurable)
~30 TB/s aggregate
~30 cycles
FlashAttention tile storage, reduction scratch
L2 cache
50 MB
~12 TB/s
~200 cycles
KV cache page hot set, weight residency for small layers
HBM3
80 GB
3.35 TB/s
~400 cycles
Model weights, KV cache, activation buffers
PCIe 5.0
Host memory
64 GB/s
~1 us
CPU offload, host KV cache tier
NVLink 4.0
Peer GPU memory
900 GB/s bidir
~1 us
Tensor parallel all-reduce within node
InfiniBand NDR
Remote node
50 GB/s per port (400 Gb/s); DGX H100 has 8x NDR ports = 400 GB/s aggregate per node
~1-5 us
Pipeline parallel activation transfer across nodes
Topology Impact on Parallelism Decisions
Tensor parallelism (TP) requires an all-reduce after every attention and FFN layer. For a hidden dimension $H=8192$ in FP16, the all-reduce message is $H \times 2 = 16$ KB (for a single token in decode). With batching, this becomes $B \times H \times 2$ bytes.
NVLink vs PCIe all-reduce time for a single decode step ($B=64$, $H=8192$, FP16, TP=2):
Per-layer overhead difference: $\sim29\;\mu s$. Over 80 layers: $2.3$ ms additional per decode step on PCIe.
For prefill with $T=2048$ tokens: message = $2048 \times 8192 \times 2 = 32$ MB. PCIe: 1 ms/layer. NVLink: 0.07 ms/layer. Over 80 layers: 80 ms vs 5.6 ms -- a 14x gap that makes PCIe TP impractical for latency-sensitive prefill.
Consequence. TP should stay within the NVLink domain (typically 4 or 8 GPUs in one node). Cross-node parallelism should use PP (pipeline parallelism), which transfers only activation tensors at pipeline stage boundaries -- much smaller and less frequent messages.
NVLink vs NVSwitch: point-to-point vs full crossbar. NVLink provides point-to-point connections between GPU pairs. Each NVLink 4.0 link delivers 25 GB/s per direction (50 GB/s bidirectional), and H100 GPUs have 18 links total, but these must be distributed across connected peers. NVSwitch is a fundamentally different component: a non-blocking crossbar switch that connects ALL GPUs in a node simultaneously. DGX H100 uses 4 third-generation NVSwitch chips to create a fully-connected 8-GPU fabric where every GPU has 900 GB/s bidirectional bandwidth to every other GPU concurrently.
This distinction is critical for communication patterns like MoE all-to-all expert dispatch. With NVSwitch, all 8 GPUs can send tokens to all other GPUs simultaneously at full bandwidth (full bisection bandwidth). With only NVLink point-to-point links (no NVSwitch), multi-hop routing would be required for non-adjacent GPU pairs, reducing effective bandwidth and increasing latency. The DGX H100 node topology combines: 8x H100 SXM GPUs + 4x NVSwitch chips (providing the all-to-all GPU fabric) + 8x ConnectX-7 400G InfiniBand NICs (providing 400 GB/s aggregate inter-node bandwidth).
PyNVML topology detection: CudaPlatform.is_fully_connected() (vllm/platforms/cuda.py, line 631) uses NVML to query NVLink connectivity between physical device IDs. Returns True only if every pair of GPUs has a direct NVLink connection (1-hop). This gates whether custom all-reduce (optimized for NVLink) or NCCL fallback (handles PCIe + NVSwitch) is used.
Custom all-reduce: custom_all_reduce.py and quick_all_reduce.py implement NVLink-optimized all-reduce that bypasses NCCL overhead for small messages (the decode case where message sizes are KB-scale and NCCL's setup latency dominates).
SGLang
Hardware backends: CUDA, ROCm, NPU (Ascend), MLX (Apple Silicon), MUSA (Moore Threads). Backend-specific graph runners in sglang/srt/hardware_backend/.
NUMA-aware process binding: For multi-socket CPU hosts, SGLang binds worker processes to the NUMA node closest to their assigned GPU, reducing cross-socket memory access latency for CPU-side operations (tokenization, scheduling).
Ascend NPU backend (sglang/srt/hardware_backend/npu/): Full graph runner implementations including EAGLE draft support (eagle_draft_npu_graph_runner.py), demonstrating SGLang's hardware abstraction -- the same speculation logic runs on NPU with backend-specific graph execution.
Comparison
Dimension
vLLM
SGLang
GPU platforms
CUDA, ROCm, XPU, TPU, CPU
CUDA, ROCm, NPU, MLX, MUSA
Topology detection
PyNVML is_fully_connected()
Relies on NCCL/torch.distributed
Custom all-reduce
Yes (NVLink-optimized)
Uses NCCL defaults
NUMA binding
No
Yes
NPU support
No (Ascend not supported)
Full Ascend NPU backend
Apple Silicon
No
MLX backend
Quiz: Test Your Understanding
Q1 (Calculation). Compute the all-reduce time for TP=2 with hidden dimension $H=8192$, FP16, batch size $B=1$ (single decode token) on NVLink 4.0 vs PCIe 5.0. Assume effective bandwidth is 50% of peak for both.
However, at 16 KB both are dominated by latency (kernel launch, synchronization), not bandwidth. Realistic times are ~2-5 us for NVLink and ~5-10 us for PCIe at this message size. The bandwidth advantage of NVLink only manifests for messages > ~100 KB.
Q2 (Concept). A deployment has 2 nodes, each with 4 GPUs connected by NVLink. No NVLink between nodes, only InfiniBand. You need to serve a model requiring 8-way parallelism. Compare TP=8 (all TP) vs TP=4+PP=2 (TP within node, PP across nodes) in terms of communication overhead.
TP=8: Every layer requires an all-reduce across all 8 GPUs. 4 of the 8 pairs cross the InfiniBand link (50 GB/s), not NVLink (900 GB/s). The all-reduce becomes bottlenecked by the slowest link. For a ring all-reduce, the cross-node segment limits throughput to ~50 GB/s for the entire operation.
TP=4 + PP=2: TP all-reduce stays within each node on NVLink (900 GB/s). PP only sends activation tensors at the pipeline boundary -- one message per micro-batch, size $B \times S \times H \times 2$ bytes. For $B=32, S=1, H=8192$: 512 KB over InfiniBand -- negligible. PP adds pipeline bubble overhead ($\sim 1/\text{num\_stages}$ fraction of compute is idle), but the communication savings from keeping TP on NVLink far outweigh this. TP=4+PP=2 is strongly preferred.
Q3 (Concept). Why does SGLang's NUMA-aware process binding matter for inference serving but not for training?
In training, the GPU is busy with long-running matmuls for hundreds of milliseconds between CPU interactions, so cross-NUMA latency for the occasional CPU operation is negligible. In serving, the CPU is in the critical path for *every* request: tokenization, scheduling decisions, KV cache management, and HTTP I/O all happen on CPU between GPU forward passes. A decode step can take < 10ms, so a cross-NUMA memory access adding 100-200ns per cache line miss accumulates significantly when processing scheduling logic for hundreds of requests. Binding the scheduler to the same NUMA node as its GPU avoids this overhead.