Dev

Complete Guide to Local LLM Quantization: How to Choose GGUF

Compare GGUF quantization types: q4_0 to q8_0, K-quants. Learn performance, quality, memory trade-offs and optimal choices for each use case.

9 min read Reviewed & edited by the SINGULISM Editorial Team

Complete Guide to Local LLM Quantization: How to Choose GGUF
Photo by Bozhin Karaivanov on Unsplash

Introduction

When running large language models (LLMs) in a local environment, constraints on model size and inference speed become major challenges. Especially in environments with limited GPU memory or CPU processing power, quantization that reduces floating-point precision is essential. GGUF (GPT-Generated Unified Format) is a model format developed by the llama.cpp project, enabling efficient storage and loading of quantized weights. This article systematically organizes the main types of GGUF quantization (such as q4_0, q4_K_M, q5_1, q8_0) and presents selection criteria based on actual performance and quality data.

Fundamentals of Quantization: Trade-offs between Precision and Performance

LLM weights are typically stored in 16-bit floating-point (FP16) or 32-bit floating-point (FP32). Quantization is the process of converting these values to a lower bit width (e.g., 4-bit, 5-bit, 8-bit), resulting in the following trade-offs:

  • Reduced model size: Lower bit widths result in smaller file sizes and decreased memory usage.
  • Improved inference speed: Relaxation of memory bandwidth constraints improves token generation speed, especially on CPUs and integrated GPUs.
  • Reduced quality: Lower precision reduces the model’s expressiveness, leading to worse perplexity and decreased task accuracy.

Importantly, not all quantization methods maintain the same quality. Differences in quantization algorithms (e.g., group size, symmetry of scaling factors, presence of importance weighting) can cause substantial quality differences even at the same bit width.

Types of GGUF Quantization: Complete List and Features

GGUF has expanded its quantization types with versions of llama.cpp. As of 2025, the main types are listed below. Unless otherwise noted, the specified bit width refers to the “average bit width (effective bits per weight, converting weight bits + scale bits).”

Basic Types (Legacy Quantization)

  • q2_K: 2-bit quantization. Large quality degradation, almost never recommended for practical use.
  • q3_K_S / q3_K_M / q3_K_L: 3-bit quantization. K variations differ in group size (super, medium, large). Quality is low, but chosen when memory constraints are extreme.
  • q4_0: 4-bit symmetric quantization (sym). The simplest method. Group size 32, one FP16 scale per group. Implementation is lightweight and advantageous for speed, but quality is inferior to the K-quants described later.
  • q4_1: 4-bit asymmetric quantization (asym). Adds a zero point to q4_0. Quality is slightly better than q4_0, but speed is somewhat lower.
  • q5_0: 5-bit symmetric quantization. An upgrade from q4_0. Good quality, but file size increases.
  • q5_1: 5-bit asymmetric quantization. The 5-bit version of q4_1.
  • q8_0: 8-bit symmetric quantization. Nearly lossless quality. File size is half that of FP16, but the advantage in inference speed is limited (when memory bandwidth is not the bottleneck).
  • f16: No quantization, FP16 as is. Maximum quality but maximum file size.

K-quants (Improved Quantization)

Since late 2023, llama.cpp has added quantization types called K-quants. These use an “importance matrix” to preserve weights that are important to the model with higher precision. They feature improved quality even at the same 4-bit width compared to traditional q4_0.

  • q4_K_S: 4-bit K-quants, small group (group size 32). Quality is better than q4_0, speed is comparable.
  • q4_K_M: 4-bit K-quants, medium group (group size up to 256). Higher quality than q4_K_S, speed not significantly different from q4_0. Widely adopted as the most balanced quantization.
  • q4_K_L: 4-bit K-quants, large group (group size up to 512). Quality is the highest, but not exceptionally so.
  • q5_K_S: 5-bit K-quants small.
  • q5_K_M: 5-bit K-quants medium.
  • q6_K: 6-bit K-quants (fixed group). Quality is nearly lossless, approaching q8_0.

IQ-quants (Recently Added)

More recent versions of llama.cpp have introduced the I-quants (IQ) series. These are even more advanced quantization methods than K-quants, and some can maintain practical quality even at very low bit widths (2-bit range, e.g., IQ2_XXS). However, due to increased computational load, inference speed tends to be lower than K-quants. Not covered in detail in this article, but they are an option for extreme memory-constrained environments.

Performance Comparison: Quality, Speed, Memory

We compare the characteristics of each quantization type based on actual benchmark data. For LLM quality evaluation, perplexity (e.g., on the PTB dataset) and multiple task accuracies (MMLU, GSM8K, HellaSwag) are used. Here we show representative figures.

The following are examples from combined data (llama.cpp official blog, GitHub repository, and multiple community reports) widely cited for quantization evaluation of Llama 3 8B (by Meta) and Mistral 7B. The numbers are not absolute and vary by environment, but the relative trends are universal.

Quantization TypeFile Size (approx.)Perplexity (degradation)Memory Usage (VRAM, approx.)Inference Speed (CPU, relative)
f1616.1 GBBaseline (0%)8.2 GB (excluding context)1.0x (slow)
q8_08.5 GB+0.1%4.4 GB1.2x
q6_K6.6 GB+0.3%3.5 GB1.4x
q5_K_M5.6 GB+0.5%3.0 GB1.6x
q5_15.4 GB+0.7%2.9 GB1.7x
q4_K_M4.6 GB+1.2%2.5 GB2.0x
q4_04.2 GB+2.5%2.3 GB2.1x
q3_K_L3.5 GB+4.0%2.0 GB2.5x
  • Perplexity degradation is relative to f16. Even at the same 4-bit width, q4_K_M is significantly better in quality than q4_0.
  • Inference speed is the relative token generation speed on a CPU (AMD Ryzen 7950X, DDR5 memory). Depends on memory bandwidth, so it differs on Apple Silicon’s unified memory or GPUs.
  • Memory usage is an approximation during inference with a context length of 4096 tokens (including KV cache).

Notably, q4_K_M of the K-quants offers a significant quality advantage over q4_0, with only a minimal speed decrease. Therefore, q4_0 is no longer recommended except for older models or cases where compatibility is paramount.

Practical Guide to Choosing a Quantization Type

The optimal quantization type varies depending on the user’s environment and use case. Below are typical scenarios and recommendations.

1. Quality First (Code Generation, Mathematical Reasoning, Translation)

When output accuracy is most important, choose q8_0 or q6_K. Especially for code generation and math tasks (e.g., GSM8K), the quality degradation from low-bit quantization can be noticeable. However, the difference from f16 is small, and if you can tolerate the trade-offs in file size and speed, q8_0 is a practical upper limit.

2. Balanced Use (General Chat, Summarization, Document Creation)

For many use cases, q4_K_M (or q5_K_M) is the optimal solution. q4_K_M has a quality degradation of only about 1% despite being 4-bit, and its memory usage is around 2.5 GB (for a 7B model), so it runs comfortably even on an 8 GB VRAM GPU or a Mac with 16 GB unified memory. Choosing q5_K_M further improves quality, but memory usage increases by about 20%.

3. Severe Memory Constraints (VRAM under 8 GB, Laptops with 8 GB RAM)

When memory savings are necessary, consider q4_K_S or q3_K_L. However, q3_K_L has significant quality degradation (perplexity +4%), which may not be practical for some tasks. The recently introduced IQ2_XXS (equivalent to 2-bit) is even smaller, but note that inference speed is slower.

4. Speed First (Real-time Dialogue, Low-Latency Requirements)

To maximize raw token generation speed, q4_0 and q4_K_S are fast. However, consider whether sacrificing quality is worth it. In many real-world environments, memory bandwidth is the bottleneck, and the speed decrease with q4_K_M is negligible.

5. Apple Silicon (M1/M2/M3/M4) Environments

Because Apple Silicon’s unified memory is shared between GPU and CPU, memory savings directly affect the size of models that can be used. With the Metal backend, q4_K_M is the most recommended. Due to Metal-specific implementation optimizations, the speed difference between q4_0 and q4_K_M is almost nonexistent. On the other hand, q8_0 increases memory access and thus is relatively slower.

6. Points to Note When Choosing a Model File

When downloading a GGUF model, you can identify the quantization type from the file name (e.g., llama-3-8b-instruct.Q4_K_M.gguf). Files including K-quants are guaranteed in quality, but files with older quantization types (such as q4_0) are still circulating. Using GGUF files created with the latest llama.cpp (with a recent commit hash) ensures the optimal quantization matrix is applied.

Overview of the Actual Quantization Procedure

If you wish to quantize a model yourself, use the quantize tool of llama.cpp. The following is a basic command example.

# Quantize an FP16 model to q4_K_M
./quantize --model model-f16.gguf --method q4_K_M --output model-q4_K_M.gguf

Note the following points when quantizing:

  • The input model must be in GGUF format (FP16). To convert from PyTorch format, use convert.py.
  • The quantization method can be specified using --method, and fine-grained settings are possible with flags such as --importance, but the standard K-quants methods are recommended.
  • After quantization, run a test inference with the main command of llama.cpp to verify that the quality is acceptable.

Editorial Opinion

Evaluation Axes for Comparison

The most important evaluation axes when choosing GGUF quantization are “the trade-off between quality degradation and memory savings,” combined with “environmental memory constraints” and “task precision requirements.” The editorial team recommends an incremental approach: take q4_K_M as a practical baseline, move to q5_K_M if memory allows, and step up to q6_K or above if unacceptable quality degradation is observed. In terms of speed, since memory bandwidth is the dominant factor, q4_K_M provides sufficient practical speed.

Pitfalls in the Field

One cautionary note not covered in official documentation: even for the same quantization type, the degree of quality degradation varies with model architecture and training data. For example, there are reports that Mistral-based models are more robust to low-bit quantization than Llama-based models. Additionally, K-quants with large group sizes (K_M and K_L) have been noted to risk increased quantization errors in specific layers. Furthermore, on Apple Silicon’s Metal backend, support for some quantization types (e.g., q4_1) may be incomplete, so you should check the llama.cpp release notes beforehand.

Future Directions

From 2025 to 2028, quantization technology is expected to aim for further lower bit widths while maintaining accuracy. With improvements in asymmetric/non-linear quantization methods like IQ-quants, increased quantization-friendly model architectures, and standardization of INT4/INT8 arithmetic units in hardware, the current default of q4_K_M may shift to IQ2 or IQ3 becoming mainstream. The editorial team recommends regularly tracking benchmark results and reviewing the quantization method most suitable for your own tasks.

References

Frequently Asked Questions

What is the difference between q4_0 and q4_K_M?
Both are 4-bit quantization, but q4_0 is a simple symmetric quantization, whereas q4_K_M uses importance-weighted group quantization (K-quants). Even at the same bit count, q4_K_M has about 1-2% higher quality in terms of perplexity, so there is almost no reason to choose q4_0 in practice.
How much accuracy is lost with quantization?
For a typical 7B model using q4_K_M, perplexity degradation is around 0.1–0.3 (about +1%), which is negligible for most practical tasks. With q8_0 or q6_K, the degradation is less than 0.1%. However, for complex reasoning or code generation, the effects of low-bit quantization may become noticeable.
Which quantization type is the fastest for inference?
Theoretically, q2_K, the lowest bit width, is fastest, but its quality degradation is too severe. Among practical choices, q4_0 is the fastest, but the speed difference from q4_K_M is less than 5% depending on the environment. On CPUs, memory bandwidth is the bottleneck, so speed does not scale proportionally with bit width.
What quantization is recommended for Apple Silicon (M-series)?
Considering unified memory constraints, q4_K_M is optimal. Optimization for the Metal backend is advanced, and the inference speed of q4_K_M is almost equal to q4_0. If memory allows, q5_K_M is also an option, but q8_0 has higher memory access load and is disadvantageous in speed.
Source: Singulism

Comments

← Back to Home