ggml-org/llama.cpp

★ 108,653C++Audience · developerComplexity · 3/5ActiveLicenseSetup · moderate

Mindmap

mindmap
  root((llama.cpp))
    What it does
      Local LLM inference
      Offline text generation
      OpenAI-compatible API
    Hardware support
      Apple Silicon optimized
      NVIDIA CUDA GPUs
      AMD and Intel CPUs
    Key features
      Model quantization
      Hybrid CPU-GPU inference
      Multimodal support
    Use cases
      Private chat servers
      Offline applications
      Edge deployment

mindmap root((llama.cpp)) What it does Local LLM inference Offline text generation OpenAI-compatible API Hardware support Apple Silicon optimized NVIDIA CUDA GPUs AMD and Intel CPUs Key features Model quantization Hybrid CPU-GPU inference Multimodal support Use cases Private chat servers Offline applications Edge deployment

Click or tap to explore — scroll the page freely

Why anyone would actually build this

REASON 1

Run open-source AI chatbots on your own hardware without relying on cloud APIs.

REASON 2

Deploy a private language model server for a team or organization with sensitive data.

REASON 3

Build applications that work offline or on edge devices with limited resources.

REASON 4

Experiment with different language models locally before deciding which to use in production.

Stack

CC++CUDAMetalVulkanARM NEONHIP

Spinning it up

Difficulty · moderate Time to first run · 30min

Requires downloading a model file (potentially gigabytes) and choosing appropriate build flags for your hardware (CPU vs CUDA vs Metal).

Use freely for any purpose, including commercial use, as long as you keep the copyright notice.

Wtf does this do

llama.cpp is a tool for running large language models (LLMs, the kind of AI that powers chat assistants) on your own machine instead of calling a cloud service. The project's stated goal is to enable LLM inference (the step where the model actually produces answers) with minimal setup and strong performance across a wide range of hardware, both locally and in the cloud. Technically, it is a plain C and C++ implementation with no external dependencies. The README highlights that Apple Silicon is treated as a first-class target, with optimizations through ARM NEON, Accelerate and Metal, that x86 chips are accelerated through AVX, AVX2, AVX512 and AMX instruction sets, and that RISC-V chips are also supported. NVIDIA GPUs are supported through custom CUDA kernels, AMD GPUs through HIP, and there are Vulkan and SYCL backends as well. To make models small enough to fit on consumer hardware, the project supports integer quantization at 1.5-bit through 8-bit precision, which shrinks models and speeds them up at some accuracy cost. It can also split work between CPU and GPU so that models larger than your GPU memory can still run, just more slowly. A long list of model families is supported, including LLaMA, Mistral, Mixtral, Gemma, Qwen, Phi, DeepSeek and many more. You would use llama.cpp if you want to run an open-weights chat model on your laptop or server without sending data to an external API, if you want to embed local model inference into your own application through its libllama library, or if you want an OpenAI-compatible API server you control via the bundled llama-server. It can be installed via brew, nix or winget, run from prebuilt binaries, used through Docker, or built from source.

Yoink these prompts

Prompt 1

How do I download and run an open-source language model using llama.cpp on my Mac?

Prompt 2

Show me how to set up llama.cpp as a local API server that's compatible with OpenAI's API format.

Prompt 3

What's the difference between running a model with and without quantization in llama.cpp, and how do I enable it?

Prompt 4

How can I use llama.cpp to run a language model on both my CPU and GPU at the same time?

Prompt 5

What model files work with llama.cpp and where can I find them?

View the repo → Decode another repo

← ggml-org on gitmyhub — every repo by this author, as a profile.

Don't trust strangers blindly. Verify against the repo.