How to run llama.cpp on Arm-based Ampere with Oracle Linux

llama.cpp is highly regarded for its performance and customization in running LLM inference on a CPU-only setting. In addition to serving LLMs, it also provides a rich toolbox for converting model binary file into its native model file format (GGUF), quantizing single and half precision format models into one of the supported schemes, as well as benchmarking performance.

In my previous post, I’ve explored the leading options for running local LLM inferencing on the Arm-based Ampere A1 shape in Oracle Cloud. In this post, let’s run the original llama.cpp and an Ampere optimized container in this environment.

By the end of this post, you will be able to

build llama.cpp from source yourself, optimized for the underlying hardware (Ampere A1) platform
obtain pre-quantized models in GGUF files
run an interactive CLI to chat with any supported model files and configure inference parameters
serve models to external apps via Python, HTTP APIs and a simple chat UI.
run a prebuilt, optimized Docker container from the offical Ampere repository and serve its optimized models

Let’s start!

Original llama.cpp

According to the official github repo, there are 4 ways to install llama.cpp on your machine:

Install llama.cpp using brew, nix or winget
Run with Docker

Official doc lists linux\arm64 support for the 3 images on the top, as well as the-rocm tagged versions.
- Running the prebuilt docker image per documentation docker run -v /llama-model:/models ghcr.io/ggml-org/llama.cpp:full --all-in-one "/models/" 7B results in
```
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
```
- The rocm images are also absent, as the repository hosts only CPU and CUDA images, not ROCm. The images are in fact in AMD’s rocm/ namespace in Docker Hub. Anyway, they are not feasible in our scenario because we do not have AMD GPU in our environment, plus the rocm/ images are also built for amd64.

Download pre-built binaries from the releases page

These binaries do not have one for linux\arm64.

Build from source

This provides the biggest legroom to optimize for local hardware.

As you see, build from source is the way to go!

Build for Ampere A1

Install all the required C++ build tools and dependencies

sudo dnf install -y git cmake make gcc-c++ libcurl-devel
Clone the llama.cpp source repo

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp, then create a build directory by mkdir build.
Configure the build with cmake for Arm64. CMake detects the system compiler, libraries, and hardware capabilities, then determines which backends and features should be included. Its output are Makefiles tailored to the hardware for compilation.
```
cmake -B build -DGGML_CPU_KLEIDIAI=ON -DLLAMA_CURL=ON
```
- -B build generates build files in the build directory
- -DGGML_CPU_KLEIDIAI=ON: KleidiAI is a library of optimized microkernels for AI workloads on Arm CPUs. These microkernels enhance performance and can be enabled for use by the CPU backend.
- -DLLAMA_CURL=ON tells the build system to include the code for downloading files, which we will use later to pull models from Hugging Face with the -hf flag.
The output shows hardware-specific optimizations for the Arm Ampere platform, enabling Arm64 CPU instructions:
- DOTPROD: hardware-accelerated dot product operations
- FMA: fused multiply-add for faster floating-point math
- FP16 vector arithmetic: reduced memory use with half-precision floats
Compile llama.cpp. This command automatically executes the build files created in the previous step.
```
cmake --build build --config Release -j4
```
- --build build builds in the build directory
- --config Release enables compiler optimizations
- -j4 runs 4 parallel jobs for faster compilation. 4 is chosen because my A1 Flex shape has 4 cores.
The build produces Arm64-optimized binaries in under a minute.

To rebuild llama.cpp with other hardware-specific flags or optimization settings later, you can remove old build files by rm -rf build, create a new build directory, then run cmake -B and cmake --build again

After compilation, you’ll find the following key tools in the /build/bin directory:

llama-cli: main inference executable
llama-server: HTTP server for model inference
llama-quantize: tool for quantization to reduce memory usage
llama-perplexity: tool for measuring perplexity and quality metrics of a model with specific text
llama-bench: benchmark inference performance with different parameters such as temperature, context-size, token length, etc.

Get models to play

We’re now ready to play!

To use llama.cpp, we will need a pre-quantized GGUF model file. GGUF is a file format for storing information needed to run a model, including model weights, hyperparameters, default generation configuration, and tokenizer. It compresses original models to reduce weight precision to different schemes with slight accuracy loss. This approach cuts model size, resulting in lower computational and memory demands and faster inference for local CPU-only inference.

Hugging Face is the biggest hub for GGUF model files. You can browse and read model cards to understand each model’s strength, training data and usage. For certain models, you can also find Spaces that host them to test whether they are suitable for your need. You can also peek into the model to check its metadata (e.g., chat template) and tensor information.

To load GGUF files, we have several choices. We can

use the -hf flag to load and cache a Hugging Face hosted GGUF model file directly. The model file will be saved to a local cache directory, so the next time you run the same command, it won’t need to download it again. > You can run this command to find all the models you’ve loaded so far: find ~/.cache -type f -name "*.gguf". In my Oracle Linux instance, they are stored in /home/opc/.cache/llama.cpp/ directory. >
download a GGUF model file using wget, curl or the Hugging Face CLI, then, use the -m flag to load it from storage.
convert existing models into GGUF using conversion scripts such as convert_hf_to_gguf.py from the source github repo we cloned. We can then quantize it with /build/bin/llama-quantize.

A HF repository can have multiple model files for different quantization types. To see which one is the default model, go to the Files tab and click Use this model then llama.cpp. You will see the default one being chosen in the load and run the model section.

Chat with `llama-cli`

Let’s take the 1st route for simplicity first. We’ll run a small model to test our installation:

./build/bin/llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

This will find the ggml-org/gemma-3-1b-it-GGUF repository on Hugging Face, download the default GGUF model file into cache, and then immediately start a chat session.

== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
- Not using system message. To change it, set a different value via -sys PROMPT

To avoid typing the full path to the tools in the future, add the executable to your PATH by export PATH=$PATH:./llama.cpp/build/bin

custom flags

A good interactive command includes flags to manage the conversation flow. llama-cli takes many arguments to build a rich chat experience. For example:

./llama-cli -hf ggml-org/gemma-3-1b-it-GGUF \
--color \
-c 2048 \
-n -1 \
-i \
--reverse-prompt "User:" \
-p "User: " \
--temp 0.5 \

Once you run this, the model will wait for your input after the User: prompt.

-hf ggml-org/gemma-3-1b-it-GGUF: Specifies the model file to use. llama-cli supports using model files from local path (-m), remote URL (-mu), or Hugging Face hub (-hf).
--color: Makes the interactive chat easier to read by color-coding your input and the model’s output
-c 2048: Sets the context size (how much text the model “remembers”) in tokens. A larger context is good for longer conversations but uses more RAM. 2048 is a good starting point for keeping track of the conversation.
-i: Puts you in interactive mode.
--reverse-prompt "User:": This is key for chatting! It tells the model to stop generating text when it thinks the “User:” should speak next.
-p "User: ": Provides the initial prompt to start the chat. The model will generate a response and then exit (this is non-interactive).
--temp 0.5: llama.cpp supports many sampling methods. Temperature and number of tokens to predict are among them. A value like 0.2 is more deterministic, while 0.8 is more creative.
-n -1: Sets the maximum number of tokens to generate. -1 means it will generate text until it hits the reverse prompt or the context is full

When you’re finished chatting, just press Ctrl+C to exit.

Serve models with `llama-server`

llama.cpp provides a highly configurable HTTP server. It hosts both its OpenAI-compatible API endpoint and a simple web UI frontend on the same host and port. The vast set of server parameters let users tweak performance for their unique hardware and usage settings.

cache a new model

Just like the CLI, we can load a HF model directly with the server

llama-server -hf bartowski/Qwen2.5-3B-GGUF:Q4_K_M --port 8081 --host 0.0.0.0

Notice we’re binding the server to all interfaces and expose the port via Docker.

--host 0.0.0.0: By default, llama-server binds only to 127.0.0.1 (localhost), which limits access to within the VM. We are now binding the server to listen on all available IPv4 network interfaces, so that it can accept connections coming from both localhost (on the same machine) and other external machines.
--port 8081 specifies the network port on which the llama-server will listen for incoming HTTP requests. It makes the server accessible externally (on your local machine and/or network) at that address. All communication, including serving the basic web UI or responding to API calls, happens through this port.

We can see the server endpoint in the log:

main: server is listening on http://0.0.0.0:8081 - starting the main loop
srv  update_slots: all slots are idle

Since llama-server provides an OpenAI-compatible endpoint, we can use cURL in another terminal to send a prompt request to confirm the server is running and responding to API requests:

curl

curl -s -X POST http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-3B",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a haiku about llama.cpp"}
    ],
    "stream": false
  }'

Command Breakdown:

curl -X POST: Specifies that you are making a POST request, which is required for sending data to the server.
http://127.0.0.1:8081/v1/chat/completions: The full URL of the API endpoint.
-H "Content-Type: application/json": This header tells the server that the data you are sending is in JSON format.
-d '{ ... }': This flag provides the JSON data to be sent in the request body. The JSON object contains a messages array, formatted according to the OpenAI chat completion API.

Our cURL command returns a valid JSON response with the model’s answer, showing that the llama-server is working correctly.

{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Llama.cpp, Proudly written,
Code in hand ."}}],"created":1758243953,"model":"Qwen2.5-3B","system_fingerprint":"b6517-69ffd891","object":"chat.completion","usage":{"completion_tokens":8,"prompt_tokens":26,"total_tokens":34},"id":"chatcmpl-SVw8VXmeZun342tdL6B1r0a5bhjK6DWY","timings":{"cache_n":0,"prompt_n":26,"prompt_ms":1479.27,"prompt_per_token_ms":56.894999999999996,"prompt_per_second":17.576236927673786,"predicted_n":8,"predicted_ms":609.225,"predicted_per_token_ms":76.153125,"predicted_per_second":13.131437482046863}}

At the same time, we can also see on the server side that it is processing the prompt.

srv  params_from_: Chat format: Content-only
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 26
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 26, n_tokens = 26, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 26, n_tokens = 26
slot      release: id  0 | task 0 | stop processing: n_past = 33, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    1479.27 ms /    26 tokens (   56.89 ms per token,    17.58 tokens per second)
       eval time =     609.23 ms /     8 tokens (   76.15 ms per token,    13.13 tokens per second)
      total time =    2088.49 ms /    34 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200

One handy tool is the performance metrics printed by llama.cpp on platform optimization used, CPU buffer size, latency and throughput:

system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | KLEIDIAI = 1 | REPACK = 1 |

llama_perf_sampler_print:    sampling time =       0.27 ms /     1 runs   (    0.27 ms per token,  3690.04 tokens per second)
llama_perf_context_print:        load time =     645.73 ms
llama_perf_context_print: prompt eval time =     343.21 ms /    12 tokens (   28.60 ms per token,    34.96 tokens per second)
llama_perf_context_print:        eval time =   30703.19 ms /   738 runs   (   41.60 ms per token,    24.04 tokens per second)
llama_perf_context_print:       total time =  236849.43 ms /   750 tokens
llama_perf_context_print:    graphs reused =        735

OpenAI Python binding

For programmatic interaction, we can use a Python library like llama-cpp-python or any library that supports the OpenAI API.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8081/v1",  
    api_key="llama" # Dummy key for local access
)

completion = client.completions.create(
    model="Qwen2.5-3B",
    prompt="Write a haiku about llama.cpp"
)

print(completion.choices[0].text)

Web UI

Lastly, we can hit http://localhost:8081 in our browser and access the default web UI. When you run llama-server without specifying a dedicated frontend, it automatically serves a simple, lightweight Chat Web UI to the same port as its API. Beneath the surface, this web interface communicates with llama-server’s REST API via the OpenAI-compatible /v1/chat/completions endpoint. It comes with all the necessary features such as - chat history - generation and sampler settings such as temperature, top_k, top_p, max tokens, penalty - reasoning setting

serve a local model

A more common practice to serve a model with llama.cpp is to run llama-server with the -m flag to load a model from a local path on your machine. In this section, we will do exactly that.

Let’s download a HF model bartowski/Qwen2.5-3B-GGUF:Q4_K_M locally so we can serve it.

First we create a local directory to store our models

mkdir model

We then download models into this folder. The most recommended way to download files from Hugging Face is by using the official huggingface-cli command-line tool.

First, install the necessary Python library: pip install --upgrade huggingface_hub
Download the model with the repository ID and the specific filename: huggingface-cli download <repo_id> <filename>

Or we can simply use wget with the exact file location. Go to the model page, click Files, pick the quantized model of your choice, then click Copy download link.

wget https://huggingface.co/bartowski/Qwen2.5-3B-GGUF/resolve/main/Qwen2.5-3B-Q4_K_M.gguf -O ./models/Qwen2.5-3B-Q4_K_M.GGUF

run server

Now we can start the server with this model

llama-server -m ./models/Qwen2.5-3B-Q4_K_M.GGUF \
  --host 0.0.0.0 \
  --port 8081 \
  --ctx-size 4096

--ctx-size 4096 sets the context window size (the number of tokens the model can “remember” in a single conversation turn).

We can then use the same cURL command and Python in the previous section to interact with it.

access server externally

To access the llama-server endpoint from an external IP using http://VM-public-IP:8081 using cURL, Python, or the simple Web UI, we have 3 options. We have a few choices:

configure our VM to enable both OCI Security List and host OS firewall, to accept traffic for the server at the port
use nginx to reverse proxy so there is no need to expose additional ports
use a Cloudflare Tunnel to enable a secure outbound connection to a Cloudflare managed domain that proxies the traffic.

For option 1, enable the followings:

To reach the VM instance, setup our OCI subnet’s ingress rule to allow inbound traffic to the external interface of the VM on port 8081.
To reach the server inside the VM, setup the VM instance’s local firewall (firewalld in Oracle Linux) to allow incoming traffic to reach llama-server at 8081, and ensure the rule persists after a reboot:

# Open port 8081 for TCP traffic permanently
$ sudo firewall-cmd --zone=public --add-port=8081/tcp --permanent
# Reload the firewall to apply the new rule
$ sudo firewall-cmd --reload
success

After configuring the firewall, verify the server is listening by sudo netstat -tuln | grep 8081. You should see a line showing 0.0.0.0:8081 in the Local Address column with a state of LISTEN.

For option 2, 1. Create new DNS records for our llama-server

Create a new config file /etc/nginx/conf.d/llama.conf with this server block:

server {
    listen 80;
    server_name custom-domain.com;

    location / {
        proxy_pass http://127.0.0.1:8081/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

Replace custom-domain.com with your real custom domain used in step 1.

Reload nginx:

sudo nginx -t
sudo systemctl reload nginx

Test externally from another machine with curl http://custom-domain.com/llama/v1/models, and nginx will forward to the endpoint.

Ampere optimized llama.cpp container

Since Ampere has an optimized build of llama.cpp in a container, as well as new quantization methods (Q4_K_4 and Q8R16 with 1.5-2x speed on inference), let’s take a look!

Make sure you have Docker installed on your system.
Ampere recommend using models in its custom quantization formats for best performance. Let’s download a model to our previously created folder ./llama-cpp/models/.

Let’s try AmpereComputing/qwen-3-4b-gguf and pick a specific quantized version from the Files tab. For this example, we’ll use Qwen3-4B-Q8R16.gguf.
```
wget https://huggingface.co/AmpereComputing/qwen-3-4b-gguf/resolve/main/Qwen3-4B-Q8R16.gguf -O ./llama-cpp/models/Qwen3-4B-Q8R16.gguf
```
Note: You must use the Ampere optimized container to run these models. Attempting to run it with the original llama.cpp will fail with error tensor 'token_embd.weight' has invalid ggml type 64 (NONE) gguf_init_from_file_impl: failed to read tensor info, implying an incompatibility between the GGUF model file and the original llama.cpp. The new GGUF version introduced by Ampere changes new ggml_type codes for tensors that the original llama.cpp does not understand.
Now we are ready to run the Ampere llama.cpp container.

By default, the Ampere container mirrors the original llama.cpp container to run llama-server as an entrypoint. Thus we can rely on this pre-configured setting and just pass the model path as an argument.

This command will start a container named llama, mount our llama-models directory to /models inside the container, start up the server with the arguments we provide, and give us an interactive shell.

docker run -it --rm \
  -v /home/opc/llama.cpp/models/:/models \
  --name llama \
  -p 8081:8081 \
  amperecomputingai/llama.cpp:latest \
  -m /models/Qwen3-4B-Q8R16.gguf --host 0.0.0.0 --port 8081 --ctx-size 4096

Command breakdown

-it: Keep the container’s standard input open and allocates a pseudo-TTY, allowing you to see the server’s log output.
--rm: Automatically remove the container when it exits, keeping your system clean.
-v ./home/opc/llama.cpp/models/:/models: Map the model directory created previously to a mount in the container:
- ./home/opc/llama.cpp/models/: The path to the directory on our host machine where our model is stored. Docker requires absolute paths for volume mounts. You can get the full, absolute path of a directory with pwd
- :/models: The path inside the container where llama-server will find the model.
-p 8081:8081: Map port 8081 on host machine to port 8081 inside the container, where llama-server listens on. This allows you to access the server outside the container.
amperecomputingai/llama.cpp:latest: The official Docker image to pull and run. > Make sure you only pull this tagged version not the 3.2.1-ampereone image, which is incompatible with the A1 Flex shape.

The following server arguments are passed directly to the image’s entrypoint, which is default to llama-server and is similar to the flags we set in the previous section.
- -m /models/Qwen3-4B-Q8R16.gguf: pass a model using models/ which corresponds to the path inside the container that we created with the -v flag. Qwen3-4B-Q8R16.gguf is the file we downloaded in step 2.
- --host 0.0.0.0 --port 8081 --ctx-size 4096: Use --host 0.0.0.0 to ensure the server listens on all network interfaces, allowing it to be accessible via the mapped port. We also sets the context size. and the port 8081 the server will listen on. Note this must match the 2nd port parameter in -p above.

interact with server

We can use the same cURL command and Python in the previous section to interact with it.

If you think the default chat UI is too minimal, you can switch to Open WebUI,designed for managing and interacting with various LLM backends (like llama.cpp, Ollama, etc.). It includes multi-model management, persistent chat history, user management, RAG (Retrieval Augmented Generation) capabilities, function/tool calling, an admin panel, and a much better UI/UX.

Run a separate Open WebUI Docker container and configure it to connect to your llama-server instance. Open WebUI allows users to manage models, engage in conversations, and generally interact with the LLMs running on the llama-server backend.
```
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
    --name open-webui --restart unless-stopped \
    ghcr.io/open-webui/open-webui:main
```
Configure the connection: Once the container is running, open your web browser and navigate to http://localhost:3000. In the settings or “Connections” section, add a new OpenAI-compatible connection.
- URL: http://host.docker.internal:8081/v1/ (The host.docker.internal is a special Docker DNS name that resolves to the host machine’s IP, allowing the UI container to reach the server container.)
- API Key: Leave this field empty.
Start chatting: After saving the connection, you should be able to select your model and interact with it through the chat interface.

running shell in container

You can also override the image’s default entrypoint by running a shell instead:

 docker run --privileged=true --name llama2 --entrypoint /bin/bash -it -v /home/opc/llama-cpp/models/:/models amperecomputingai/llama.cpp:latest

The shell will launch into the /llm folder where all the tools binary are. From here, you can run this to launch an interactive CLI to chat

./llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

switching models

Unlike Ollama that can hot swap models from its local storage, the llama-server binary loads a single model file when it starts and does not have a built-in command to dynamically swap models while running. To change the model being served by llama-server, you must restart the server with the new model path specified in the -m flag.

If your goal is to have multiple models accessible at the same time, you have two primary options:

Run a separate server for each model: Start a new llama-server container for each model and use a different port for each one to avoid conflicts.
Use a model management proxy: Tools like llama-Swap or FlexLLama can act as a proxy. You configure them with a list of your models and their paths. These proxies listen on a single port and automatically start and stop llama-server based on the model selected in your API request (i.e., hot swapping). After the server started, they proxy your HTTP requests to it.