Convert and quantize LLM models with Ampere optimized llama.cpp container

In previous posts, we explored running LLMs on Ampere-based VMs. Now, we’ll tackle a critical optimization: quantization. This process shrinks models for faster inference and a smaller memory footprint, making it possible to deploy massive, billion-parameter models on resource-constrained hardware, including the free-tier Ampere A1 Compute instances in Oracle Cloud.

This post will first explore the “what, why, and how” of model quantization, from data types to the core methodologies. Specifically, I will distinguish each quantization method by their behaviors in quantization building (when and how) and model inference (what and when to dequantize and compute). Then, we’ll walk through a hands-on example to download, convert, and quantize the pfnet/plamo-2-translate from Hugging Face using the Ampere-optimized llama.cpp container.

The key reason for using this specific container is to leverage Ampere’s specialized quantization schemes, Q4_K_4 and Q8R16. These formats claim a 1.5x inference speed boost on Ampere hardware while maintaining model size and perplexity comparable to the standard Q4_K and Q8 formats.

Note: The hands-on section focuses on PTWO (Post-Training Weight-Only) quantization using llama.cpp. For a broader discussion of other formats and engines, see this guide in choosing quants and engines.

What is Quantization?

LLMs are commonly trained GPUs using full precision 32-bit or half precision 16-bit format for parameters and weights. The combined size of model and weight determines the amount of GPU memory needed for inference.

At its core, quantization is the process of reducing the precision of a neural network’s weights and, in some cases, activations. It’s analogous to compressing a high-resolution image to a smaller file size.

Precision vs. Performance

Neural networks are typically trained using FP32 (32-bit single-precision floating point), which offers a wide dynamic range and high precision. For inference, this is often overkill.

FP16 (16-bit half-precision) is a common first step, halving the model size and speeding up computation on compatible hardware (like GPUs).
INT8 (8-bit integer) and INT4 (4-bit integer) go further, using far fewer bits. An FP32 model quantized to INT4 can be 1/8th the original size.

Benefits:

Smaller model size: Reduces storage costs and enables deployment on edge devices.
Faster inference: This is the key. Smaller data types mean less data to move from DRAM to the CPU caches. For LLMs, inference is often memory-bandwidth bound, not compute-bound. Moving 4-bit data is much faster than moving 16-bit or 32-bit data, especially on CPUs where DRAM is significantly slower than a GPU’s HBM.
Increased scalability: A lower memory footprint per model allows for more scalable deployment.

Downside:

Potential accuracy loss: The most significant drawback. Reducing precision is a lossy conversion. The more aggressive the quantization (e.g., 4-bit or 3-bit), the higher the risk of degrading the model’s performance, which we measure using metrics like perplexity.

How it works: The math of scaling

Quantization maps a high-precision floating-point value to a lower-precision integer. For a given tensor \(x\), the quantized value \(q(x)\) is found using a scale factor \(s\) and a zero-point \(z\):

\[q(x) = \text{round}\left( \frac{x - z}{s} \right)\]

Dequantization reverses this, introducing a rounding error \(\epsilon\):

\[\hat{x} = s \cdot q(x) + z \quad (\text{where } \hat{x} = x + \epsilon)\]

A naive, “uniform” approach would apply one scale factor to an entire weight tensor. This fails badly for LLMs, which contain outliers, a few weights with very large absolute values. These outliers would force a large scale \(s\), causing all the “normal” (and more numerous) small-weight values to be rounded to near-zero, destroying the model’s accuracy.

The solution is block-based quantization. Instead of one scale, the tensor is divided into small blocks (e.g., 32 or 64 weights). Each block gets its own scale factor (and zero-point), isolating the impact of outliers and preserving precision for the majority of weights.

In llama.cpp’s GGML backend, this is explicit. For example, the block_q4_0 struct processes 32 FP16 weights:

#define QK4_0 32
typedef struct {
    ggml_fp16_t d;          // scale factor (delta)
    uint8_t     qs[QK4_0 / 2]; // packed 4-bit quants (16 bytes for 32 values)
} block_q4_0;

This block-wise approach is fundamental to all modern quantization formats, including the K-quants we’ll use.

4 main paradigms: when to quantize

We can broadly categorize quantization methods by when and how they are applied. The main trade-off is between ease of implementation (Post-Training) and maximum accuracy (QAT).

Post-Training Weight-Only Quantization (PTWO):
- What it is: Only the weights are quantized offline. Activations remain in full (or half) precision during inference.
- How it works: At runtime, the weights are dequantized back to FP16/FP32 on-the-fly, just before the matrix multiplication (matmul).
- Pros: Simple, fast, no calibration data needed. This is what llama.cpp uses.
- Cons: No speedup from integer-only matmul, as the compute is still done in FP. The speed gain comes entirely from reduced memory bandwidth.
Post-Training Dynamic Quantization (PTD):
- What it is: Weights are quantized offline. Activations are quantized dynamically (on-the-fly) during inference.
- Pros: Can use integer matmul units (fast!). No calibration data needed.
- Cons: The runtime cost of calculating activation scales can be high.
Post-Training Static Quantization (PTS):
- What it is: Weights and activations are quantized.
- How it works: Requires a “calibration” step to run a small, representative dataset through the model to collect the typical range (min/max) of the activations. These static scales are then saved and used during inference.
- Pros: Best performance. Integer matmul with no runtime scaling overhead.
- Cons: Requires a good calibration dataset.
Quantization-Aware Training (QAT):
- What it is: Quantization is simulated during the model’s training loop.
- How it works: The model learns to adapt its weights to be “quantization-friendly” from the start.
- Pros: The highest possible accuracy for low-bit (e.g., INT4) quantization.
- Cons: Requires full retraining, which is computationally massive and expensive.

Building the model offline

This table compares the offline “building” phase for each paradigm.

Step	PTWO (e.g., `llama-quantize`)	PTD (e.g., PyTorch dynamic)	PTS (e.g., PyTorch static)	QAT (e.g., TensorFlow QAT)
1. Load Model	Load FP weights (W).	Load FP W.	Load FP W.	Load FP model; insert “fake quant” nodes.
2. Weight Quant	Static: Compute per-block scales from W stats. No data needed.	Static: Compute per-tensor/channel scales for W.	Static: Quantize W using stats from calibration data.	During training: Fake quant W; use straight-through estimator for gradients.
3. Activation	None. Handled at runtime (stays FP).	None offline.	Calibration: Run 100-500 samples to find static activation scales.	During training: Fake quant activations to adapt model.
4. Export	Store quantized W + scales (e.g., GGUF).	Store quantized W.	Store quantized W + static activation scales.	Train; then export the final quantized model.
Time Cost	Low (seconds-minutes)	Low	Medium (minutes, needs calibration)	High (full retraining)
Accuracy	Moderate Loss (outliers unhandled)	Higher Loss (runtime scales suboptimal)	Low Loss (calibrated scales)	Lowest Loss (model adapts to quant)

Running the model

This table compares what happens at inference time. The key difference is whether the matmul itself happens in integer (INT8) or floating-point (FP) precision.

Step	PTWO (`llama.cpp`)	PTD	PTS	QAT
Input	FP activation	FP activation	FP activation	FP activation
Weight Prep	Dequantize W to FP (on-the-fly)	Keep W as INT8	Keep W as INT8	Keep W as INT8
Activation Prep	None (already FP)	Dynamic: Quantize activation to INT8	Static: Quantize activation to INT8 (fast)	Static: Quantize activation
Matmul + Bias	FP Matmul: (FP input \(\cdot\) FP W)	INT Matmul: (INT8 input \(\cdot\) INT8 W)	INT Matmul: (INT8 input \(\cdot\) INT8 W)	INT Matmul
Dequant Output	None (already FP)	Dequant output to FP	Dequant output to FP	Dequant output to FP
Overhead	Dequant cycles (10-20%)	Dynamic scale calc (+5-10%)	Minimal	Minimal
End-to-End Speed	1x (Gains from memory BW only)	1.5-3x (Gains from INT compute)	1.5-3x+ (Fastest)	1.5-3x+

Ampere Optimization

Since we’re using llama.cpp, our focus is squarely on PTWO (Post-Training Weight-Only) quantization.

GGUF and K-Quants

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp. It’s a single file that bundles model metadata and all the quantized weights.

Within GGUF, there are many quantization schemes like like Q4_0, Q5_K_M, Q8_0, etc. The “K-quants” (_K) are particularly important. They are mixed-precision schemes.

LLM weights are not equally important. Tensors related to attention (e.g., attention.wv) are often more sensitive to quantization than others. To balance quality and size, K-quants use a higher bit-depth for these “important” layers and a lower bit-depth for the rest.

Q4_K_M: A popular choice. Uses 4-bit for most weights, but upgrades half of the attention.wv and feed_forward.w2 layers to 6-bit (Q6_K) to preserve accuracy.
Q8_0: A uniform 8-bit quantization. Nearly lossless quality but results in a large file.

When choosing a GGUF quant, we’re trading file size for perplexity (accuracy). The llama.cpp maintainers typically recommend Q4_K_M or Q5_K_M for a good balance.

Approximate Sizes for a 9.5B Model:

Q2_K.gguf: ~3.5-4GB
Q4_K_M.gguf: ~5-5.5GB
Q6_K.gguf: ~7-7.5GB
Q8_0.gguf: ~9-10GB
F16.gguf: ~18-19GB (unquantized)

Ampere’s Q4_K_4 and Q8R16

This is where the Ampere-optimized container shines. It introduces two custom GGUF V3 schemes tuned for the ARM (SVE/NEON) instruction set and cache architecture of Ampere CPUs.

Q4_K_4: Tiled 4-Bit K-Quantization

What it is: This is a standard Q4_K quant, but the data is stored in a “tiled by 4 rows” layout.
Why it’s faster: This layout is optimized for Ampere’s 64-byte cache lines. During inference, the dequantization kernel can load 4 rows of data at once using SVE vector instructions. This maximizes cache hits and vector-unit parallelism, significantly speeding up the “Dequantize W to FP” step from our runtime table.
Trade-off: Same size and perplexity as Q4_K_M but with a 1.5-2x speedup on Ampere hardware.

Q8R16: Mixed-Precision FP8 with Residual FP16

What it is: A more advanced mixed-precision scheme. Compute-heavy components (like Q/K/V projections) are quantized to FP8 (8-bit float), while more sensitive parts (residuals, embeddings) are left in FP16.
Why it’s better: FP8, with its built-in exponent bits, handles outliers much better than INT8, resulting in lower perplexity loss. It’s a near-lossless 8-bit format.
Trade-off: Same size as Q8_0 (~9-10GB for a 9.5B model) but with better-than-Q8_0 accuracy and the 1.5-2x inference speedup from optimized FP8-to-FP16 dequantization kernels.

Hands-On: Quantizing with Ampere Container

Let’s walk through the process of quantizing the pfnet/plamo-2-translate model (a 9.5B parameter Japanese-English translator) using the Ampere container.

Recommended Scheme: We’ll use Q8R16. For a translation task, accuracy is critical. Q8R16 provides the 1.5x speed boost with almost no quality loss. The final ~10GB model will fit easily in a 24GB RAM Ampere A1 instance.

Our Workflow:

Download the original Hugging Face model.
Convert the HF model to a full-precision F16.gguf file.
Use the container’s llama-quantize tool to convert the F16.gguf to our target Q8R16.gguf quantized model file.

Note: We must quantize from the original F16 GGUF. Never re-quantize an already-quantized model (e.g., quantizing a Q8_0 GGUF to Q4_K_M). This will compound the quantization errors and result in a much worse model.

Prerequisites

Docker installed.
An Ampere A1 VM (or any aarch64 machine).
Sufficient disk space (~20GB for the original model + ~10GB for the final quant).
llama.cpp in local host, because the Ampere container does not contain convert_hf_to_gguf.py.

Step 1: Download HF model

Go to your llama.cpp installation folder. Then download the model from Hugging Face. These repos use git-lfs (Large File Storage) for the multi-GB model weights.

cd /models

# Install git and git-lfs
dnf update && dnf install -y git git-lfs

# Set up LFS
git lfs install

# Clone the model repo. This will download all LFS files.
# This will take time and space (~19GB)
git clone https://huggingface.co/pfnet/plamo-2-translate

After it finishes, verify the download. The model-*.safetensors files should be several GB each.

ls -lh plamo-2-translate

Step 2: Convert HF model to F16 GGUF

Now, we use llama.cpp’s convert_hf_to_gguf.py script to convert the safetensors format into a GGUF file. We specify f16 as the output type to preserve full precision.

# still in the /models directory
python3 /blog/convert_hf_to_gguf.py plamo-2-translate \
  --outtype f16 \
  --outfile plamo-2-translate-f16.gguf

This will process the shards and create a single plamo-2-translate-f16.gguf file, which will be ~18-19GB.

Step 3: Launch Ampere llama.cpp container

First, pull the image and launch an interactive bash shell. We’ll mount our local llama.cpp/models directory into the container at /models to persist our downloaded and quantized files.

# Pull the latest image
docker pull amperecomputingai/llama.cpp

# Run the container interactively
docker run -it \
  -v /home/opc/llama.cpp/models/:/models \
  amperecomputingai/llama.cpp:latest \
  /bin/bash

We are now inside the container shell at the /llm directory.

Step 4: Quantize F16 to Q8R16

This is the final step. We run the llama-quantize binary (which is compiled with Ampere optimizations) on our F16 GGUF to produce the Q8R16 quantized version.


./llama-quantize /blog/models/plamo-2-translate-f16.gguf \
  /blog/models/plamo-2-translate-q8r16.gguf \
  Q8R16

This will take a few minutes. When it’s done, we’ll have our optimized model!

cd /blog/models/
ls -lh
# We will see:
# ... plamo-2-translate-f16.gguf   (18G)
# ... plamo-2-translate-q8r16.gguf (9.8G)

We can now exit the container. The final plamo-2-translate-q8r16.gguf file is available on our host machine in the llama.cpp/models directory, ready for inference.

Step 5: Test quantized model

We can quickly test the new model using the llama-cli tool (which is also in the container).

From our VM host, or re-enter the container by:

docker run -it \
  -v /home/opc/llama.cpp/models/:/models \
  amperecomputingai/llama.cpp:latest \
  /bin/bash

Inside the container, from the default llm directory:

./llama-cli -m /blog/models/plamo-2-translate-q8r16.gguf \
  --prompt "<|plamo:op|>dataset\ntranslation\n<|plamo:op|>input lang=Japanese|English\n銀座でランチをご一緒しましょう。\n<|plamo:op|>output\n" \
  -n 50

We will get a translation simlar to “Let’s have lunch in Ginza together”.

Conclusion

We now have a highly optimized, quantized model that leverages Ampere-specific hardware optimizations for a significant inference speedup with minimal quality loss. Our workflow: downloading from HF, converting to F16 GGUF, and then quantizing to an optimized format like Q8R16 is a powerful and repeatable process for deploying LLMs efficiently on Ampere A1 instances.

In my next post, I will quantify the actual performance gain of quantizing from F16 to 4-bit. I’ll then go further to benchmark the Ampere quant schemes like Q4_K_4 against a standard Q4_K_M version to verify the speedup in prompt processing (TTFT) and token generation (TPS). Stay tuned!