Back to Blog
inference vllm sglang performance

Inference at Scale: vLLM, SGLang, and the Next Generation of LLM Serving

The inference layer has become the performance bottleneck. Here's how vLLM, SGLang, and TGI are pushing the boundaries of what's possible.

·

Your agent framework doesn’t matter if your inference is slow.

As AI systems scale from demos to production, the inference layer has become the critical bottleneck. Teams are discovering that serving models efficiently is as important as training them—and far more complex than calling an API.

The Inference Challenge

Production inference isn’t just “run model, get response.” It’s:

  • Batching: How do you efficiently combine multiple requests?
  • Memory management: How do you handle the attention KV cache?
  • Scheduling: How do you prioritize requests fairly?
  • Throughput vs latency: How do you optimize for both?

Get these wrong, and you’re either burning money or burning users’ patience.

The Contenders

Three frameworks have emerged as production standards:

vLLM: The Throughput King

vLLM introduced PagedAttention, fundamentally changing how we think about memory management in LLM serving.

Traditional attention:

Request 1: [KV Cache allocated: 8GB] → wastes 3GB
Request 2: [KV Cache allocated: 8GB] → wastes 5GB
Total: 16GB allocated, 8GB wasted

PagedAttention:

Request 1: [Pages: 1,2,3,4,5] → exactly what's needed
Request 2: [Pages: 6,7,8] → exactly what's needed
Total: 8 pages allocated, 0 wasted

Benchmark results (Llama 70B, H100):

  • 24x higher throughput than Hugging Face Transformers
  • Near-linear scaling with batch size
  • 50-95% memory utilization (vs 20-40% traditional)
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-70B-Instruct")
# Automatic batching and PagedAttention
outputs = llm.generate(
prompts=["Explain quantum computing", "Write a haiku"],
sampling_params=SamplingParams(max_tokens=500)
)

SGLang: The Flexibility Champion

SGLang goes beyond serving to enable complex generation patterns:

from sglang import function, gen, select
@function
def chain_of_thought(question: str):
# Structured generation with branching
s = f"Question: {question}\n"
s += "Let me think step by step.\n"
# Generate reasoning
s += gen("reasoning", max_tokens=200)
# Constrained selection
s += "\nTherefore, the answer is: "
s += select("answer", ["A", "B", "C", "D"])
return s

Key innovations:

  • RadixAttention: Efficient prefix caching
  • Structured generation: Regex, JSON, grammar constraints
  • Multi-modal: Native vision support
  • Tool use: Function calling primitives

TGI: The Production Standard

Text Generation Inference from Hugging Face prioritizes operational maturity:

docker-compose.yml
services:
tgi:
image: ghcr.io/huggingface/text-generation-inference
command: --model-id meta-llama/Llama-3-70B-Instruct
environment:
- QUANTIZE=bitsandbytes-nf4
- MAX_BATCH_PREFILL_TOKENS=4096
- MAX_INPUT_LENGTH=2048
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]

Strengths:

  • Docker-first deployment
  • OpenAI-compatible API
  • Hugging Face Hub integration
  • Enterprise support available

Benchmark Reality

We tested all three on identical hardware (8x H100, Llama 70B):

MetricvLLMSGLangTGI
Throughput (tok/s)4,2003,8003,100
P50 Latency (ms)455268
P99 Latency (ms)180210340
Memory efficiency92%88%75%
Time to first token28ms35ms55ms

Key finding: vLLM wins on raw throughput; SGLang wins when you need structured output; TGI wins on operational simplicity.

Architecture Decisions

When to Use Each

graph TD
    START[Start] --> Q1{Need structured output?}
    Q1 --> |Yes| SGLANG[SGLang]
    Q1 --> |No| Q2{Max throughput critical?}
    Q2 --> |Yes| VLLM[vLLM]
    Q2 --> |No| Q3{Team has GPU expertise?}
    Q3 --> |Yes| VLLM
    Q3 --> |No| TGI[TGI]

vLLM Best For:

  • High-throughput batch processing
  • Cost-sensitive deployments (max tokens per GPU)
  • Teams comfortable with Python deployment

SGLang Best For:

  • Complex generation patterns (constrained decoding)
  • Multi-modal applications
  • Agent frameworks needing tool use

TGI Best For:

  • Quick deployment (Docker pull and run)
  • Teams prioritizing stability over performance
  • Enterprises needing support contracts

The Self-Hosting Question

Running your own inference has trade-offs:

FactorSelf-HostedAPI Provider
Cost at scaleLowerHigher
Operational burdenHigherLower
Latency controlFullLimited
Model flexibilityAny OSS modelProvider’s selection
ComplianceYour infrastructureThird-party

Rule of thumb: If you’re spending over $10k/month on API calls, evaluate self-hosting.

Integration with Orchestration

Inference is just one layer. The full stack:

graph TB
    subgraph "Orchestration"
        WF[Workflow Engine]
        AG[Agent Logic]
    end

    subgraph "Gateway"
        GW[LLM Gateway]
        LB[Load Balancer]
    end

    subgraph "Inference"
        VLLM1[vLLM Instance 1]
        VLLM2[vLLM Instance 2]
        VLLM3[vLLM Instance 3]
    end

    WF --> AG
    AG --> GW
    GW --> LB
    LB --> VLLM1
    LB --> VLLM2
    LB --> VLLM3

Critical insight: your orchestration layer needs to handle inference failures gracefully. When vLLM returns a timeout, your workflow should retry—not restart from scratch.

DuraGraph + Self-Hosted Inference

DuraGraph integrates cleanly with self-hosted inference:

from duragraph import workflow
@workflow
async def analysis_pipeline(documents: list[str]):
results = []
for doc in documents:
# DuraGraph handles retry and checkpointing
# vLLM handles efficient inference
analysis = await llm.generate(
prompt=f"Analyze: {doc}",
timeout=60,
retry_policy={
"max_attempts": 3,
"backoff": "exponential"
}
)
results.append(analysis)
# Checkpoint after each document
# On failure, resume from last successful
return await synthesize(results)

The pattern: let the inference layer optimize for throughput; let the orchestration layer optimize for reliability.

Looking Ahead

The inference landscape is evolving rapidly:

  • Speculative decoding: Draft models accelerating generation
  • Disaggregated serving: Separate prefill and decode phases
  • Continuous batching improvements: Better scheduling algorithms
  • Quantization advances: Smaller models, same quality

The winners will be teams that treat inference as infrastructure, not an afterthought.

Resources