Strona główna›Narzędzia AI›Inferencja Lokalna›vLLM

vLLM

O narzędziu

vLLM — open-source LLM inference engine (Apache 2.0). V1 (od I 2025, default v0.8.0+) — ground-up rewrite. 1.7x throughput vs V0. Prefix caching: <1% throughput decrease nawet 0% cache hit. FlashAttention 3 integration. Async scheduling z speculative decoding (zero-bubble overlap). Marzec 2026: Model Runner V2 (MRV2) — re-implementation z piecewise CUDA graphs, spec decode rejection sampler, multi-modal embeddings.

📋

Zastosowanie

•Production LLM serving (high throughput).
•Multi-tenant LLM API (z prefix caching).
•Self-hosted alternative dla OpenAI API.
•Multi-LoRA serving (custom modeli).
•Distributed inference (multi-GPU/multi-node).

✨

Funkcje dodatkowe

▶V1 Architecture

Ground-up rewrite core engine — scheduler, KV cache manager, worker, sampler, API server. 1.7x throughput vs V0, default od v0.8.0. Najwieksza performance improvement w vLLM history.

▶Model Runner V2 (MRV2, NEW III 2026)

Ground-up re-implementation model runner. Piecewise CUDA graphs dla pipeline parallelism. Spec decode rejection sampler. Multi-modal embeddings dla spec decode. Streaming inputs. EPLB support.

▶FlashAttention 3

Latest FlashAttention z mixing prefill/decode w batch. Skraca memory bandwidth pressure i pozwala na wyzszy throughput dla long-context workloads (100K+ tokens).

▶Chunked Prefill

Chunked prefill bez separate kernel launches — eliminuje pipeline bubbles i znaczaco poprawia throughput dla mieszanego ruchu (prefill-heavy + decode-heavy requests).

▶Async Scheduling + Speculative Decoding

Async scheduling z speculative decoding (zero-bubble overlap). Speculative decoding methods: n-gram, suffix, EAGLE, DFlash. Speeds up token generation 1.5-3x dla suitable workloads.

▶Multi-LoRA Serving

Efficient multi-LoRA support dla dense i MoE layers — serve 100s LoRA adapters z jednego base model. Krytyczne dla production scenariuszy z customized models per customer.

▶Quantization (FP8/INT4/AWQ/GPTQ/GGUF)

Pelne spektrum quantization: FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF. Pozwala na deployment 70B+ modeli na mniejszych GPU bez utraty performance.

▶PagedAttention

Efficient management of attention key i value memory — KV cache podzielony na fixed-size blocks (pages). Pozwala na high memory utilization i better batching niz tradycyjne approaches.

▶Continuous Batching

Automatyczne laczenie przychodzacych zapytan w batche — request moze do laczyc sie w trakcie generacji innych. Maksymalizuje GPU utilization dla mixed-length workloads.

▶Distributed Inference

Tensor, pipeline, data, expert, i context parallelism. Pozwala deployowac modele wieksze niz pamiec pojedynczego GPU (np. Llama 70B na 8x A100).