Strona głównaNarzędzia AIInferencja LokalnavLLM
vLLM

vLLM

0(0)·Inferencja Lokalna
Darmowy (open-source)Odwiedź stronę →

O narzędziu

vLLM — open-source LLM inference engine (Apache 2.0). V1 (od I 2025, default v0.8.0+) — ground-up rewrite. 1.7x throughput vs V0. Prefix caching: <1% throughput decrease nawet 0% cache hit. FlashAttention 3 integration. Async scheduling z speculative decoding (zero-bubble overlap). Marzec 2026: Model Runner V2 (MRV2) — re-implementation z piecewise CUDA graphs, spec decode rejection sampler, multi-modal embeddings.

📋

Zastosowanie

  • Production LLM serving (high throughput).
  • Multi-tenant LLM API (z prefix caching).
  • Self-hosted alternative dla OpenAI API.
  • Multi-LoRA serving (custom modeli).
  • Distributed inference (multi-GPU/multi-node).

Funkcje dodatkowe

V1 Architecture

Ground-up rewrite core engine — scheduler, KV cache manager, worker, sampler, API server. 1.7x throughput vs V0, default od v0.8.0. Najwieksza performance improvement w vLLM history.

Model Runner V2 (MRV2, NEW III 2026)

Ground-up re-implementation model runner. Piecewise CUDA graphs dla pipeline parallelism. Spec decode rejection sampler. Multi-modal embeddings dla spec decode. Streaming inputs. EPLB support.

FlashAttention 3

Latest FlashAttention z mixing prefill/decode w batch. Skraca memory bandwidth pressure i pozwala na wyzszy throughput dla long-context workloads (100K+ tokens).

Chunked Prefill

Chunked prefill bez separate kernel launches — eliminuje pipeline bubbles i znaczaco poprawia throughput dla mieszanego ruchu (prefill-heavy + decode-heavy requests).

Async Scheduling + Speculative Decoding

Async scheduling z speculative decoding (zero-bubble overlap). Speculative decoding methods: n-gram, suffix, EAGLE, DFlash. Speeds up token generation 1.5-3x dla suitable workloads.

Multi-LoRA Serving

Efficient multi-LoRA support dla dense i MoE layers — serve 100s LoRA adapters z jednego base model. Krytyczne dla production scenariuszy z customized models per customer.

Quantization (FP8/INT4/AWQ/GPTQ/GGUF)

Pelne spektrum quantization: FP8, MXFP8/MXFP4, NVFP4, INT8, INT4, GPTQ/AWQ, GGUF. Pozwala na deployment 70B+ modeli na mniejszych GPU bez utraty performance.

PagedAttention

Efficient management of attention key i value memory — KV cache podzielony na fixed-size blocks (pages). Pozwala na high memory utilization i better batching niz tradycyjne approaches.

Continuous Batching

Automatyczne laczenie przychodzacych zapytan w batche — request moze do laczyc sie w trakcie generacji innych. Maksymalizuje GPU utilization dla mixed-length workloads.

Distributed Inference

Tensor, pipeline, data, expert, i context parallelism. Pozwala deployowac modele wieksze niz pamiec pojedynczego GPU (np. Llama 70B na 8x A100).

✓ Zalety

+V1 — 1.7x throughput vs V0
+MRV2 (III 2026) — piecewise CUDA graphs
+Prefix caching <1% overhead at 0% hit
+FlashAttention 3 + chunked prefill
+Async scheduling z speculative decoding
+Open-source Apache 2.0 (production-ready)
🧠

Model Runner V2 (MRV2, III 2026)

  • Ground-up re-implementation model runner.
  • Piecewise CUDA graphs dla pipeline parallelism.
  • Spec decode rejection sampler (greedy/logprobs).
  • Multi-modal embeddings dla spec decode.
  • Streaming inputs.
  • EPLB support.
💰

Cennik

  • Open-source (Apache 2.0).
  • $0.
  • GitHub: vllm-project/vllm.
  • NVIDIA enterprise support dostępny.
🔗

API i integracje

  • OpenAI-compatible API server.
  • Python library (pip install vllm).
  • REST + Streaming.
  • Hugging Face models (auto-download).
  • Distributed: tensor + pipeline parallelism.
📋

V1 architecture (default od v0.8.0)

  • Ground-up rewrite core engine.
  • Re-architected: scheduler, KV cache manager, worker, sampler, API server.
  • Incremental State Updates — caches request states, transmits diffs.
  • 1.7x throughput vs V0.
  • Prefix caching <1% overhead at 0% hit.
📋

Performance features

  • FlashAttention 3 (mixing prefill/decode w batch).
  • Chunked prefill bez separate kernel launches.
  • Async scheduling z speculative decoding (zero-bubble overlap).
  • Multi-LoRA serving.
  • Quantization (AWQ, GPTQ, FP8).

Szczegóły

CenaDarmowy (open-source)
KategoriaInferencja Lokalna
V1 engine1.7x throughputMRV2FlashAttention 3Prefix caching