Strona główna›Narzędzia AI›Inferencja Lokalna›llama.cpp

llama.cpp

O narzędziu

llama.cpp — fastest local LLM inference (CPU/GPU). April 2026: 170+ incremental releases z backend-agnostic tensor parallelism, 1-bit quantization, day-one Gemma 4 support, hardware backends AMD CDNA4 i Qualcomm Hexagon. SYCL Q8_0 reorder: 21% → 66% bandwidth = 3x throughput dla Qwen3.5-27B. Q1_0 quantization dla constrained devices. Speculative decoding (small draft model + larger target).

✨

Funkcje 2026

•Tensor parallelism (build b8738, IV 2026) — true tensor split across multiple GPUs bez vendor lock.
•Wcześniej: layer-splitting.
•1-bit quantization (Q1_0) — modele na constrained devices.
•Day-one Gemma 4 support.
•AMD CDNA4 i Qualcomm Hexagon backends.
•AWQ integration w GGUF pipeline.

✨

Funkcje dodatkowe

▶Tensor Parallelism (NEW IV 2026)

True tensor split across multiple GPUs bez vendor lock (build b8738). Wczesniej tylko layer-splitting — nowa technika dramatycznie poprawia performance dla multi-GPU setups.

▶1-bit Quantization (Q1_0, BitNet)

1-bit quantization (Q1_0) dla modeli na constrained devices — phones, embedded systems. Plus BitNet 1.58 bit quantization — trzecia rewolucja quantizationu po INT4/INT8.

▶Day-one Gemma 4 Support

Wsparcie dla najnowszych modeli Google Gemma 4 od dnia release. llama.cpp pozostaje gold standard dla open-source inference, zwykle pierwszy z optimized kernels dla nowych architektur.

▶AMD CDNA4 + Qualcomm Hexagon

Nowe backends 2026: AMD CDNA4 (datacenter GPUs) i Qualcomm Hexagon (mobile NPUs). Najszersze hardware support w branzy open-source LLM inference.

▶AWQ Integration

Activation-aware Weight Quantization integration w GGUF pipeline. Lepsze wyniki niz pure INT4 dla niektorych modeli — szczegolnie dla edge cases (math, code generation).

▶SYCL Q8_0 (Intel Arc)

Bandwidth utilization wzrost z 21% do 66% dla Intel Arc GPUs przez SYCL Q8_0 reorder. Otwiera path dla Intel jako viable alternative dla NVIDIA w LLM inference.

▶3x Throughput dla Qwen3.5-27B

Specific optimization dla Qwen3.5-27B — 3x throughput vs poprzednie wersje. Pokazuje, ze llama.cpp continues aggressive optimization dla popularnych modeli, nie tylko "works out of the box".

▶Speculative Decoding

Small draft model + larger target model — wyzsza throughput dla memory-bandwidth-bound systems. Krytyczne dla CPU-only inference, gdzie memory bandwidth jest bottleneckiem.

▶Hardware Backends

NVIDIA CUDA, Apple Metal, AMD ROCm/CDNA4, Intel SYCL, Qualcomm Hexagon, CPU-only (x86, ARM). Cross-platform: Linux, macOS, Windows, Android, iOS. Najszersze wsparcie hardware w branzy.

▶llama-server REST API

Wbudowany llama-server z REST API (OpenAI-compatible). Pozwala uruchomic LLM jako web service bez koniecznosci zewnetrznych wrapperow jak Ollama lub LM Studio.

✓ Zalety

+170+ releases w IV 2026 (active development)

+Tensor parallelism (true split across GPUs)

+1-bit quantization (constrained devices)

+AMD CDNA4 + Qualcomm Hexagon backends

+GGUF: 2-bit do 8-bit + bfloat16

+Foundation dla Ollama, LM Studio, llamafile

💰

Cennik

•Open-source (MIT).
•$0.
•GitHub: ggml-org/llama.cpp.

🔗

API i integracje

•Llama.cpp main binary (CLI).
•Llama-server (REST API, OpenAI-compatible).
•Python bindings (llama-cpp-python).
•Used by: Ollama, LM Studio, llamafile, GPT4All.

📋

Quantization formats

•GGUF: 2-bit do 8-bit integer types.
•Float32, float16, bfloat16.
•1.58 bit quantization (BitNet).
•Q1_0 (1-bit).
•TQ3_0 (CPU TurboQuant).
•AWQ + Q4_K_M (community work).

📋

Performance

•SYCL Q8_0 reorder: 21% → 66% bandwidth utilization (Intel Arc GPUs).
•3x throughput dla Qwen3.5-27B.
•Speculative decoding (small draft + larger target) — wyższa throughput dla memory-bandwidth-bound systems.

📋

Hardware backends

•NVIDIA CUDA, Apple Metal, AMD ROCm, AMD CDNA4 (NEW).
•Intel SYCL, Qualcomm Hexagon (NEW).
•CPU-only (x86, ARM).
•Cross-platform (Linux, macOS, Windows, Android, iOS).

Szczegóły

CenaDarmowy (open-source)

KategoriaInferencja Lokalna

170 releases IV 2026Tensor parallelism1-bit quantGGUFAMD CDNA4

Podobne narzędzia