In previous posts, we explored running LLMs on Ampere-based VMs. Now, we’ll tackle a critical optimization: quantization. This process shrinks models for faster inference and a smaller memory footprint, making it possible to deploy massive, billion-parameter models on resource-constrained hardware, including the free-tier Ampere A1 Compute instances in Oracle Cloud.
This post will first explore the “what, why, and how” of model quantization, from data types to the core methodologies. Specifically, I will distinguish each quantization method by their behaviors in quantization building (when and how) and model inference (what and when to dequantize and compute). Then, we’ll walk through a hands-on example to download, convert, and quantize the pfnet/plamo-2-translate from Hugging Face using the Ampere-optimized llama.cpp container.
The key reason for using this specific container is to leverage Ampere’s specialized quantization schemes, Q4_K_4 and Q8R16. These formats claim a 1.5x inference speed boost on Ampere hardware while maintaining model size and perplexity comparable to the standard Q4_K and Q8 formats.
Note: The hands-on section focuses on PTWO (Post-Training Weight-Only) quantization using
llama.cpp. For a broader discussion of other formats and engines, see this guide in choosing quants and engines.
What is Quantization?
LLMs are commonly trained GPUs using full precision 32-bit or half precision 16-bit format for parameters and weights. The combined size of model and weight determines the amount of GPU memory needed for inference.
At its core, quantization is the process of reducing the precision of a neural network’s weights and, in some cases, activations. It’s analogous to compressing a high-resolution image to a smaller file size.
Precision vs. Performance
Neural networks are typically trained using FP32 (32-bit single-precision floating point), which offers a wide dynamic range and high precision. For inference, this is often overkill.
- FP16 (16-bit half-precision) is a common first step, halving the model size and speeding up computation on compatible hardware (like GPUs).
- INT8 (8-bit integer) and INT4 (4-bit integer) go further, using far fewer bits. An FP32 model quantized to INT4 can be 1/8th the original size.
Benefits:
- Smaller model size: Reduces storage costs and enables deployment on edge devices.
- Faster inference: This is the key. Smaller data types mean less data to move from DRAM to the CPU caches. For LLMs, inference is often memory-bandwidth bound, not compute-bound. Moving 4-bit data is much faster than moving 16-bit or 32-bit data, especially on CPUs where DRAM is significantly slower than a GPU’s HBM.
- Increased scalability: A lower memory footprint per model allows for more scalable deployment.
Downside:
- Potential accuracy loss: The most significant drawback. Reducing precision is a lossy conversion. The more aggressive the quantization (e.g., 4-bit or 3-bit), the higher the risk of degrading the model’s performance, which we measure using metrics like perplexity.
How it works: The math of scaling
Quantization maps a high-precision floating-point value to a lower-precision integer. For a given tensor \(x\), the quantized value \(q(x)\) is found using a scale factor \(s\) and a zero-point \(z\):
\[q(x) = \text{round}\left( \frac{x - z}{s} \right)\]
Dequantization reverses this, introducing a rounding error \(\epsilon\):
\[\hat{x} = s \cdot q(x) + z \quad (\text{where } \hat{x} = x + \epsilon)\]
A naive, “uniform” approach would apply one scale factor to an entire weight tensor. This fails badly for LLMs, which contain outliers, a few weights with very large absolute values. These outliers would force a large scale \(s\), causing all the “normal” (and more numerous) small-weight values to be rounded to near-zero, destroying the model’s accuracy.
The solution is block-based quantization. Instead of one scale, the tensor is divided into small blocks (e.g., 32 or 64 weights). Each block gets its own scale factor (and zero-point), isolating the impact of outliers and preserving precision for the majority of weights.
In llama.cpp’s GGML backend, this is explicit. For example, the block_q4_0 struct processes 32 FP16 weights:
#define QK4_0 32
typedef struct {
ggml_fp16_t d; // scale factor (delta)
uint8_t qs[QK4_0 / 2]; // packed 4-bit quants (16 bytes for 32 values)
} block_q4_0;This block-wise approach is fundamental to all modern quantization formats, including the K-quants we’ll use.
4 main paradigms: when to quantize
We can broadly categorize quantization methods by when and how they are applied. The main trade-off is between ease of implementation (Post-Training) and maximum accuracy (QAT).
Post-Training Weight-Only Quantization (PTWO):
- What it is: Only the weights are quantized offline. Activations remain in full (or half) precision during inference.
- How it works: At runtime, the weights are dequantized back to FP16/FP32 on-the-fly, just before the matrix multiplication (matmul).
- Pros: Simple, fast, no calibration data needed. This is what
llama.cppuses. - Cons: No speedup from integer-only matmul, as the compute is still done in FP. The speed gain comes entirely from reduced memory bandwidth.
Post-Training Dynamic Quantization (PTD):
- What it is: Weights are quantized offline. Activations are quantized dynamically (on-the-fly) during inference.
- Pros: Can use integer matmul units (fast!). No calibration data needed.
- Cons: The runtime cost of calculating activation scales can be high.
Post-Training Static Quantization (PTS):
- What it is: Weights and activations are quantized.
- How it works: Requires a “calibration” step to run a small, representative dataset through the model to collect the typical range (min/max) of the activations. These static scales are then saved and used during inference.
- Pros: Best performance. Integer matmul with no runtime scaling overhead.
- Cons: Requires a good calibration dataset.
Quantization-Aware Training (QAT):
- What it is: Quantization is simulated during the model’s training loop.
- How it works: The model learns to adapt its weights to be “quantization-friendly” from the start.
- Pros: The highest possible accuracy for low-bit (e.g., INT4) quantization.
- Cons: Requires full retraining, which is computationally massive and expensive.
Building the model offline
This table compares the offline “building” phase for each paradigm.
| Step | PTWO (e.g., llama-quantize) |
PTD (e.g., PyTorch dynamic) | PTS (e.g., PyTorch static) | QAT (e.g., TensorFlow QAT) |
|---|---|---|---|---|
| 1. Load Model | Load FP weights (W). | Load FP W. | Load FP W. | Load FP model; insert “fake quant” nodes. |
| 2. Weight Quant | Static: Compute per-block scales from W stats. No data needed. | Static: Compute per-tensor/channel scales for W. | Static: Quantize W using stats from calibration data. | During training: Fake quant W; use straight-through estimator for gradients. |
| 3. Activation | None. Handled at runtime (stays FP). | None offline. | Calibration: Run 100-500 samples to find static activation scales. | During training: Fake quant activations to adapt model. |
| 4. Export | Store quantized W + scales (e.g., GGUF). | Store quantized W. | Store quantized W + static activation scales. | Train; then export the final quantized model. |
| Time Cost | Low (seconds-minutes) | Low | Medium (minutes, needs calibration) | High (full retraining) |
| Accuracy | Moderate Loss (outliers unhandled) | Higher Loss (runtime scales suboptimal) | Low Loss (calibrated scales) | Lowest Loss (model adapts to quant) |
Running the model
This table compares what happens at inference time. The key difference is whether the matmul itself happens in integer (INT8) or floating-point (FP) precision.
| Step | PTWO (llama.cpp) |
PTD | PTS | QAT |
|---|---|---|---|---|
| Input | FP activation | FP activation | FP activation | FP activation |
| Weight Prep | Dequantize W to FP (on-the-fly) | Keep W as INT8 | Keep W as INT8 | Keep W as INT8 |
| Activation Prep | None (already FP) | Dynamic: Quantize activation to INT8 | Static: Quantize activation to INT8 (fast) | Static: Quantize activation |
| Matmul + Bias | FP Matmul: (FP input \(\cdot\) FP W) | INT Matmul: (INT8 input \(\cdot\) INT8 W) | INT Matmul: (INT8 input \(\cdot\) INT8 W) | INT Matmul |
| Dequant Output | None (already FP) | Dequant output to FP | Dequant output to FP | Dequant output to FP |
| Overhead | Dequant cycles (10-20%) | Dynamic scale calc (+5-10%) | Minimal | Minimal |
| End-to-End Speed | 1x (Gains from memory BW only) | 1.5-3x (Gains from INT compute) | 1.5-3x+ (Fastest) | 1.5-3x+ |
Ampere Optimization
Since we’re using llama.cpp, our focus is squarely on PTWO (Post-Training Weight-Only) quantization.
GGUF and K-Quants
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp. It’s a single file that bundles model metadata and all the quantized weights.
Within GGUF, there are many quantization schemes like like Q4_0, Q5_K_M, Q8_0, etc. The “K-quants” (_K) are particularly important. They are mixed-precision schemes.
LLM weights are not equally important. Tensors related to attention (e.g., attention.wv) are often more sensitive to quantization than others. To balance quality and size, K-quants use a higher bit-depth for these “important” layers and a lower bit-depth for the rest.
Q4_K_M: A popular choice. Uses 4-bit for most weights, but upgrades half of theattention.wvandfeed_forward.w2layers to 6-bit (Q6_K) to preserve accuracy.Q8_0: A uniform 8-bit quantization. Nearly lossless quality but results in a large file.
When choosing a GGUF quant, we’re trading file size for perplexity (accuracy). The llama.cpp maintainers typically recommend Q4_K_M or Q5_K_M for a good balance.
Approximate Sizes for a 9.5B Model:
Q2_K.gguf: ~3.5-4GBQ4_K_M.gguf: ~5-5.5GBQ6_K.gguf: ~7-7.5GBQ8_0.gguf: ~9-10GBF16.gguf: ~18-19GB (unquantized)
Ampere’s Q4_K_4 and Q8R16
This is where the Ampere-optimized container shines. It introduces two custom GGUF V3 schemes tuned for the ARM (SVE/NEON) instruction set and cache architecture of Ampere CPUs.
Q4_K_4: Tiled 4-Bit K-Quantization
- What it is: This is a standard Q4_K quant, but the data is stored in a “tiled by 4 rows” layout.
- Why it’s faster: This layout is optimized for Ampere’s 64-byte cache lines. During inference, the dequantization kernel can load 4 rows of data at once using SVE vector instructions. This maximizes cache hits and vector-unit parallelism, significantly speeding up the “Dequantize W to FP” step from our runtime table.
- Trade-off: Same size and perplexity as
Q4_K_Mbut with a 1.5-2x speedup on Ampere hardware.
Q8R16: Mixed-Precision FP8 with Residual FP16
- What it is: A more advanced mixed-precision scheme. Compute-heavy components (like Q/K/V projections) are quantized to FP8 (8-bit float), while more sensitive parts (residuals, embeddings) are left in FP16.
- Why it’s better: FP8, with its built-in exponent bits, handles outliers much better than INT8, resulting in lower perplexity loss. It’s a near-lossless 8-bit format.
- Trade-off: Same size as
Q8_0(~9-10GB for a 9.5B model) but with better-than-Q8_0 accuracy and the 1.5-2x inference speedup from optimized FP8-to-FP16 dequantization kernels.
Hands-On: Quantizing with Ampere Container
Let’s walk through the process of quantizing the pfnet/plamo-2-translate model (a 9.5B parameter Japanese-English translator) using the Ampere container.
Recommended Scheme: We’ll use Q8R16. For a translation task, accuracy is critical. Q8R16 provides the 1.5x speed boost with almost no quality loss. The final ~10GB model will fit easily in a 24GB RAM Ampere A1 instance.
Our Workflow:
- Download the original Hugging Face model.
- Convert the HF model to a full-precision
F16.gguffile. - Use the container’s
llama-quantizetool to convert theF16.ggufto our targetQ8R16.ggufquantized model file.
Note: We must quantize from the original F16 GGUF. Never re-quantize an already-quantized model (e.g., quantizing a Q8_0 GGUF to Q4_K_M). This will compound the quantization errors and result in a much worse model.
Prerequisites
- Docker installed.
- An Ampere A1 VM (or any
aarch64machine). - Sufficient disk space (~20GB for the original model + ~10GB for the final quant).
llama.cppin local host, because the Ampere container does not containconvert_hf_to_gguf.py.
Step 1: Download HF model
Go to your llama.cpp installation folder. Then download the model from Hugging Face. These repos use git-lfs (Large File Storage) for the multi-GB model weights.
cd /models
# Install git and git-lfs
dnf update && dnf install -y git git-lfs
# Set up LFS
git lfs install
# Clone the model repo. This will download all LFS files.
# This will take time and space (~19GB)
git clone https://huggingface.co/pfnet/plamo-2-translateAfter it finishes, verify the download. The model-*.safetensors files should be several GB each.
ls -lh plamo-2-translateStep 2: Convert HF model to F16 GGUF
Now, we use llama.cpp’s convert_hf_to_gguf.py script to convert the safetensors format into a GGUF file. We specify f16 as the output type to preserve full precision.
# still in the /models directory
python3 /blog/convert_hf_to_gguf.py plamo-2-translate \
--outtype f16 \
--outfile plamo-2-translate-f16.ggufThis will process the shards and create a single plamo-2-translate-f16.gguf file, which will be ~18-19GB.
Step 3: Launch Ampere llama.cpp container
First, pull the image and launch an interactive bash shell. We’ll mount our local llama.cpp/models directory into the container at /models to persist our downloaded and quantized files.
# Pull the latest image
docker pull amperecomputingai/llama.cpp
# Run the container interactively
docker run -it \
-v /home/opc/llama.cpp/models/:/models \
amperecomputingai/llama.cpp:latest \
/bin/bashWe are now inside the container shell at the /llm directory.
Step 4: Quantize F16 to Q8R16
This is the final step. We run the llama-quantize binary (which is compiled with Ampere optimizations) on our F16 GGUF to produce the Q8R16 quantized version.
./llama-quantize /blog/models/plamo-2-translate-f16.gguf \
/blog/models/plamo-2-translate-q8r16.gguf \
Q8R16This will take a few minutes. When it’s done, we’ll have our optimized model!
cd /blog/models/
ls -lh
# We will see:
# ... plamo-2-translate-f16.gguf (18G)
# ... plamo-2-translate-q8r16.gguf (9.8G)We can now exit the container. The final plamo-2-translate-q8r16.gguf file is available on our host machine in the llama.cpp/models directory, ready for inference.
Step 5: Test quantized model
We can quickly test the new model using the llama-cli tool (which is also in the container).
From our VM host, or re-enter the container by:
docker run -it \
-v /home/opc/llama.cpp/models/:/models \
amperecomputingai/llama.cpp:latest \
/bin/bashInside the container, from the default llm directory:
./llama-cli -m /blog/models/plamo-2-translate-q8r16.gguf \
--prompt "<|plamo:op|>dataset\ntranslation\n<|plamo:op|>input lang=Japanese|English\n銀座でランチをご一緒しましょう。\n<|plamo:op|>output\n" \
-n 50We will get a translation simlar to “Let’s have lunch in Ginza together”.
Conclusion
We now have a highly optimized, quantized model that leverages Ampere-specific hardware optimizations for a significant inference speedup with minimal quality loss. Our workflow: downloading from HF, converting to F16 GGUF, and then quantizing to an optimized format like Q8R16 is a powerful and repeatable process for deploying LLMs efficiently on Ampere A1 instances.
In my next post, I will quantify the actual performance gain of quantizing from F16 to 4-bit. I’ll then go further to benchmark the Ampere quant schemes like Q4_K_4 against a standard Q4_K_M version to verify the speedup in prompt processing (TTFT) and token generation (TPS). Stay tuned!