O narzędziu
llama.cpp — fastest local LLM inference (CPU/GPU). April 2026: 170+ incremental releases z backend-agnostic tensor parallelism, 1-bit quantization, day-one Gemma 4 support, hardware backends AMD CDNA4 i Qualcomm Hexagon. SYCL Q8_0 reorder: 21% → 66% bandwidth = 3x throughput dla Qwen3.5-27B. Q1_0 quantization dla constrained devices. Speculative decoding (small draft model + larger target).
Funkcje 2026
- •Tensor parallelism (build b8738, IV 2026) — true tensor split across multiple GPUs bez vendor lock.
- •Wcześniej: layer-splitting.
- •1-bit quantization (Q1_0) — modele na constrained devices.
- •Day-one Gemma 4 support.
- •AMD CDNA4 i Qualcomm Hexagon backends.
- •AWQ integration w GGUF pipeline.
Funkcje dodatkowe
▶Tensor Parallelism (NEW IV 2026)
True tensor split across multiple GPUs bez vendor lock (build b8738). Wczesniej tylko layer-splitting — nowa technika dramatycznie poprawia performance dla multi-GPU setups.
▶1-bit Quantization (Q1_0, BitNet)
1-bit quantization (Q1_0) dla modeli na constrained devices — phones, embedded systems. Plus BitNet 1.58 bit quantization — trzecia rewolucja quantizationu po INT4/INT8.
▶Day-one Gemma 4 Support
Wsparcie dla najnowszych modeli Google Gemma 4 od dnia release. llama.cpp pozostaje gold standard dla open-source inference, zwykle pierwszy z optimized kernels dla nowych architektur.
▶AMD CDNA4 + Qualcomm Hexagon
Nowe backends 2026: AMD CDNA4 (datacenter GPUs) i Qualcomm Hexagon (mobile NPUs). Najszersze hardware support w branzy open-source LLM inference.
▶AWQ Integration
Activation-aware Weight Quantization integration w GGUF pipeline. Lepsze wyniki niz pure INT4 dla niektorych modeli — szczegolnie dla edge cases (math, code generation).
▶SYCL Q8_0 (Intel Arc)
Bandwidth utilization wzrost z 21% do 66% dla Intel Arc GPUs przez SYCL Q8_0 reorder. Otwiera path dla Intel jako viable alternative dla NVIDIA w LLM inference.
▶3x Throughput dla Qwen3.5-27B
Specific optimization dla Qwen3.5-27B — 3x throughput vs poprzednie wersje. Pokazuje, ze llama.cpp continues aggressive optimization dla popularnych modeli, nie tylko "works out of the box".
▶Speculative Decoding
Small draft model + larger target model — wyzsza throughput dla memory-bandwidth-bound systems. Krytyczne dla CPU-only inference, gdzie memory bandwidth jest bottleneckiem.
▶Hardware Backends
NVIDIA CUDA, Apple Metal, AMD ROCm/CDNA4, Intel SYCL, Qualcomm Hexagon, CPU-only (x86, ARM). Cross-platform: Linux, macOS, Windows, Android, iOS. Najszersze wsparcie hardware w branzy.
▶llama-server REST API
Wbudowany llama-server z REST API (OpenAI-compatible). Pozwala uruchomic LLM jako web service bez koniecznosci zewnetrznych wrapperow jak Ollama lub LM Studio.
✓ Zalety
Cennik
- •Open-source (MIT).
- •$0.
- •GitHub: ggml-org/llama.cpp.
API i integracje
- •Llama.cpp main binary (CLI).
- •Llama-server (REST API, OpenAI-compatible).
- •Python bindings (llama-cpp-python).
- •Used by: Ollama, LM Studio, llamafile, GPT4All.
Quantization formats
- •GGUF: 2-bit do 8-bit integer types.
- •Float32, float16, bfloat16.
- •1.58 bit quantization (BitNet).
- •Q1_0 (1-bit).
- •TQ3_0 (CPU TurboQuant).
- •AWQ + Q4_K_M (community work).
Performance
- •SYCL Q8_0 reorder: 21% → 66% bandwidth utilization (Intel Arc GPUs).
- •3x throughput dla Qwen3.5-27B.
- •Speculative decoding (small draft + larger target) — wyższa throughput dla memory-bandwidth-bound systems.
Hardware backends
- •NVIDIA CUDA, Apple Metal, AMD ROCm, AMD CDNA4 (NEW).
- •Intel SYCL, Qualcomm Hexagon (NEW).
- •CPU-only (x86, ARM).
- •Cross-platform (Linux, macOS, Windows, Android, iOS).
