O narzędziu
vLLM — open-source LLM inference engine (Apache 2.0). V1 (od I 2025, default v0.8.0+) — ground-up rewrite. 1.7x throughput vs V0. Prefix caching: <1% throughput decrease nawet 0% cache hit. FlashAttention 3 integration. Async scheduling z speculative decoding (zero-bubble overlap). Marzec 2026: Model Runner V2 (MRV2) — re-implementation z piecewise CUDA graphs, spec decode rejection sampler, multi-modal embeddings.
Zastosowanie
- •Production LLM serving (high throughput).
- •Multi-tenant LLM API (z prefix caching).
- •Self-hosted alternative dla OpenAI API.
- •Multi-LoRA serving (custom modeli).
- •Distributed inference (multi-GPU/multi-node).
Funkcje dodatkowe
▶V1 Architecture
Ground-up rewrite core engine — scheduler, KV cache manager, worker, sampler, API server. 1.7x throughput vs V0, default od v0.8.0. Najwieksza performance improvement w vLLM history.
▶Model Runner V2 (MRV2, NEW III 2026)
Ground-up re-implementation model runner. Piecewise CUDA graphs dla pipeline parallelism. Spec decode rejection sampler. Multi-modal embeddings dla spec decode. Streaming inputs. EPLB support.
▶FlashAttention 3
Latest FlashAttention z mixing prefill/decode w batch. Skraca memory bandwidth pressure i pozwala na wyzszy throughput dla long-context workloads (100K+ tokens).
▶Chunked Prefill
Chunked prefill bez separate kernel launches — eliminuje pipeline bubbles i znaczaco poprawia throughput dla mieszanego ruchu (prefill-heavy + decode-heavy requests).
▶Async Scheduling + Speculative Decoding
Async scheduling z speculative decoding (zero-bubble overlap). Speculative decoding methods: n-gram, suffix, EAGLE, DFlash. Speeds up token generation 1.5-3x dla suitable workloads.
▶Multi-LoRA Serving
Efficient multi-LoRA support dla dense i MoE layers — serve 100s LoRA adapters z jednego base model. Krytyczne dla production scenariuszy z customized models per customer.
▶Quantization (FP8/INT4/AWQ/GPTQ/GGUF)
Pelne spektrum quantization: FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF. Pozwala na deployment 70B+ modeli na mniejszych GPU bez utraty performance.
▶PagedAttention
Efficient management of attention key i value memory — KV cache podzielony na fixed-size blocks (pages). Pozwala na high memory utilization i better batching niz tradycyjne approaches.
▶Continuous Batching
Automatyczne laczenie przychodzacych zapytan w batche — request moze do laczyc sie w trakcie generacji innych. Maksymalizuje GPU utilization dla mixed-length workloads.
▶Distributed Inference
Tensor, pipeline, data, expert, i context parallelism. Pozwala deployowac modele wieksze niz pamiec pojedynczego GPU (np. Llama 70B na 8x A100).
✓ Zalety
Model Runner V2 (MRV2, III 2026)
- •Ground-up re-implementation model runner.
- •Piecewise CUDA graphs dla pipeline parallelism.
- •Spec decode rejection sampler (greedy/logprobs).
- •Multi-modal embeddings dla spec decode.
- •Streaming inputs.
- •EPLB support.
Cennik
- •Open-source (Apache 2.0).
- •$0.
- •GitHub: vllm-project/vllm.
- •NVIDIA enterprise support dostępny.
API i integracje
- •OpenAI-compatible API server.
- •Python library (pip install vllm).
- •REST + Streaming.
- •Hugging Face models (auto-download).
- •Distributed: tensor + pipeline parallelism.
V1 architecture (default od v0.8.0)
- •Ground-up rewrite core engine.
- •Re-architected: scheduler, KV cache manager, worker, sampler, API server.
- •Incremental State Updates — caches request states, transmits diffs.
- •1.7x throughput vs V0.
- •Prefix caching <1% overhead at 0% hit.
Performance features
- •FlashAttention 3 (mixing prefill/decode w batch).
- •Chunked prefill bez separate kernel launches.
- •Async scheduling z speculative decoding (zero-bubble overlap).
- •Multi-LoRA serving.
- •Quantization (AWQ, GPTQ, FP8).
