Benchmark local LLM inference engines in Oracle Ampere

In my last blog posts, we’ve compared several options for running LLMs locally in an Oracle Cloud Infrastructure (OCI) Ampere-based Linux VM. The Ampere A1 Flex instances on OCI are ARM64-based, CPU-only shapes powered by Ampere Altra processors. They’re excellent for parallel CPU inference and produce efficient per-core performance without GPUs. It’s also cost-effective at ~$0.01/OCPU-hour.

We then moved on to see Ollama (both the Ampere optimized container and original), Docker Model Runner (DMR) and llama.cpp (both the Ampere optimized container and original) in action on the free-tier Ampere A1 Flex with 4 cores and 24GB RAM. All of them used GGUF-quantized models for efficiency. Ollama is a user-friendly wrapper around llama.cpp, while DMR is a Docker-native tool (introduced in 2025) that also leverages llama.cpp for containerized runs. This makes Ollama and DMR very similar in resource use (i.e., the overhead from Ollama’s API server and DMR’s containerization), with llama.cpp offering raw performance but requiring more manual tuning.

In this post, let’s benchmark their performance and resource usage. We will look at how fast they are able to serve a model and return message to a client. More importantly, we will also compare a F16 model file with its 4-bit quantized version to check the gain in performance. Finally, we will also see how much system resource each inference engine consumes to do their job.

Let’s go!

Methodology

1. Standardized environment

To get meaningful results, we must eliminate external variables with the following steps:

Model: Use the exact same model file (qwen2.5:3b) with the exact same quantization level (Q4_K_M) for all servers. For Ollama, the default quant for this model is Q4_K_M. For all others, the quant is explicitly specified in the run. In the last 3 runs, we will run the Ampere llama.cpp container with its own optimized quants to verify its performance claim.
Context: Use the same host machine - Ampere A1 Flex shape with Oracle Linux (CPU, RAM, OS) for all tests.
Prompt/Output: Use a fixed test case:
- Prompt Text: Use the same text for every run.
Warming up: Run the inference once on each server before recording metrics. This forces the model to load into memory/cache and ensures we’re measuring a steady state.

2. LLM inference engines

Engine	Configuration & Focus
`llama.cpp` (Direct)	Run the bare-metal `llama-server` executable with explicit flags (`--n-threads 4`, `--n-gpu-layers 0` for CPU-only, `--ctx-size 2048`, etc.). This gives us maximum control and minimum overhead.
Ollama	Use `ollama run <model>`. The simplest approach, focused on ease of use. We are measuring `llama.cpp` + the Ollama service overhead.
Docker Model Runner	Use `docker run <model>`. We are measuring `llama.cpp` + containerization overheads.
Ampere optimized Ollama	Run the Ampere optimized Ollama Docker container. We are measuring `llama.cpp` + the Ollama service + containerization overheads.
Ampere optimized llama.cpp	Run the Ampere optimized llama.cpp Docker container. We are measuring `llama.cpp` + containerization overheads.

Benchmark inference

Performance Metrics

LLM performance is measured in two distinct phases, and both must be tracked to gauge user experience (UX) and system throughput.

Metric	Phase	Description	Key Factors	UX Impact
Time To First Token (TTFT)	Prefill (prompt evaluation)	Duration from when a user sends a prompt to when they receive the 1st response character.	Prompt length (reading, tokenizing, and computing the Key-Value (KV) cache for the entire input prompt), system load/queueing, network latency	Perceived responsiveness Does the model feel fast to start?
Tokens Per Second (TPS)	Decoding (Output generation)	The rate at which the model streams subsequent tokens (excluding TTFT). This is the sustained throughput.	GPU memory bandwidth, KV cache efficiency, model size.	Smoothness/Flow of the stream (Is the output consistent?)
Total Latency	complete request-response cycle	TTFT + time to generate remaining tokens	Combination of all factors.	Overall wait time.

Workflow

We will use a Python OpenAI client to send the same prompt to our LLM inference servers (each denote with its own base_url) multiple times, ignore the first “warm-up” run. We will then average the results for a fair comparison. The script will print the results for both latency and throughput.

Prerequisites: Make sure you have Python and the openai library installed with pip install openai

Open the Python script and change the MODEL_NAME variables to the exact model you are using (e.g., llama3, mistral, etc.) for each engine. Then run the script from your terminal by python benchmark_llm.py.

import openai
import time
import os

# --- Configuration ---
# Set the specific model names for each service
MODEL_NAME = "qwen2.5:3b" 

# Default OpenAI-compatible endpoints
BASE_URL = "http://localhost:8085/v1" 

# Benchmark settings
PROMPT = "Explain the significance of Meiji Restoration in about 150 words."
NUM_RUNS = 5  # Number of times to run the benchmark for averaging (plus 1 warm-up)

# --- Clients ---
llm_client = openai.OpenAI(
    base_url=BASE_URL,
    api_key="not needed", # required but not used
) 

def benchmark_endpoint(client: openai.OpenAI, model_name: str, service_name: str):
    """Runs a benchmark against a single endpoint and returns the metrics."""
    print(f"\n--- Benchmarking {service_name} (Model: {model_name}) ---")
  
    results = []

    for i in range(NUM_RUNS):
        print(f"  Run {i+1}/{NUM_RUNS}...", end="", flush=True)
  
        start_time = time.time()
        first_token_time = None
        token_count = 0

        try:
            stream = client.chat.completions.create(
                model=model_name,
                messages=[{"role": "user", "content": PROMPT}],
                stream=True,
            )

            for chunk in stream:
                if not getattr(chunk, "choices", None):
                    continue
                delta = getattr(chunk.choices[0], "delta", None)
                if not delta or not getattr(delta, "content", None):
                    continue
                if first_token_time is None:
                    first_token_time = time.time()
                token_count += 1
  
            end_time = time.time()

            total_time = end_time - start_time
            ttft = (first_token_time - start_time) * 1000  # in milliseconds
  
            # Avoid division by zero if response is instant
            generation_time = end_time - first_token_time if first_token_time else total_time
            tps = token_count / generation_time if generation_time > 0 else float('inf')

            # We ignore the first run as it's a "warm-up"
            if i > 0:
                results.append({"ttft": ttft, "tps": tps, "total_time": total_time, "tokens": token_count})
  
            print(f" Done. TTFT: {ttft:.2f} ms, TPS: {tps:.2f} t/s")

        except Exception as e:
            print(f" FAILED. Could not connect to {service_name}. Is it running? Error: {e}")
            return None # Exit if a connection fails
  
    return results
 
def main():
    """Main function to run and print benchmark results."""
    results = benchmark_endpoint(llm_client, MODEL_NAME, "LLM Server 1") 

    if not results:
        print("\nBenchmark could not be completed due to connection errors.")
        return

    # Calculate averages
    avg_ttft = sum(r['ttft'] for r in  results) / len(results)
    avg_tps = sum(r['tps'] for r in  results) / len(results)
  
    print("\n--- Benchmark Results ---")
    print(f"Runs per Service (excluding warm-up): {NUM_RUNS - 1}")
    print("-" * 40)
    print(f"🟢 LLM Server 1 (Model: {MODEL_NAME}):")
    print(f"  - Average Time to First Token (TTFT): {avg_ttft:.2f} ms")
    print(f"  - Average Tokens Per Second (TPS):   {avg_tps:.2f} t/s")
    print("-" * 40)
   
if __name__ == "__main__":
    main()

Results

To illustrate, here’s the result we got from llama.cpp running model Qwen2.5-3B-Q4_K_M:

--- Benchmarking llama.cpp Original (Model:  Qwen2.5-3B-Q4_K_M) ---
  Run 1/5... Done. TTFT: 1760.62 ms, TPS: 10.73 t/s
  Run 2/5... Done. TTFT: 94.08 ms, TPS: 10.64 t/s
  Run 3/5... Done. TTFT: 92.15 ms, TPS: 10.29 t/s
  Run 4/5... Done. TTFT: 92.35 ms, TPS: 10.56 t/s
  Run 5/5... Done. TTFT: 99.50 ms, TPS: 10.79 t/s
--- Benchmark Results ---
Runs per Service (excluding warm-up): 4🟢 
----------------------------------------
llama.cpp Original (Model: Qwen2.5-3B-Q4_K_M):
Average Time to First Token (TTFT): 94.52 ms
Average Tokens Per Second (TPS):   10.57 t/s

After running the test on all our services, here’s a summary table consolidating all our runs:

Engine | Model | TTFT (ms) |TPS (t/s)
---------|----------|---------|---------
**llama.cpp**  | **qwen2.5:3B-f16** |  **173.60**  |  **5.72**
 Ollama Container | qwen2.5:3b | 154.35 |9.04
 Ollama Original | qwen2.5:3b | 225.62 |7.81
 Ampere Ollama container | Qwen2.5-3B-GGUF:Q4_K_M  |  203.46 |11.84
 Docker Model Runner | ai/qwen2.5:3B-Q4_K_M | 699.81 |8.26
 llama.cpp  | Qwen2.5-3B-GGUF:Q4_K_M | 94.52 |10.57 
 Ampere llama.cpp container | Qwen2.5-3B-Q4_K_M | 89.29 |13.36
 Ampere llama.cpp container | Qwen2.5-3B-Q4_K_4 | 483.83 |15.48
 Ampere llama.cpp container | Qwen2.5-3B-Q8R16 | 351.88 |14.33
 Ampere llama.cpp container | Qwen3-8B-Q8R16 | 422.75 |6.13

We first ran llama.cpp against the F16 model file which is the closest we can get to the original model. This establishes a baseline for our subsequent quantized runs.

Compared with running F16, llama.cpp with Q4_K_M nearly halves TTFT while doubles TPS, showing the benefits of quantization:

TTFT (94.52 ms): To recall, TTFT for a request is largely the time taken to run the Prompt Evaluation (pre-fill). 94.52 ms is a very efficient TTFT for a ≈27-token prompt on a small CPU model, representing a per-token pre-fill time of ≈3.5 ms/token.
TPS (10.57 TPS): TPS is the speed of token generation (decode), which is memory-bound. 10.57 TPS was the system’s baseline memory bandwidth limit.

Not surprisingly, the Ampere containers (both Ollama and llama.cpp) are the most performant in total throughput, with the optimized Ampere llama.cpp container running Q4_K_4 and Q8R16 both leading the pack in throughput. Without hardware optimization, native llama.cpp beats both Ollama and DMR, due to the latter twins’ API/container overheads wrapping around llama.cpp.

It’s also worth noting that both llama.cpp and the Ampere optimized container excels in responsiveness during prompt evaluation. When we switch to the Ampere quant schemes, responsiveness drops significantly.

Benchmark resource usage

Even a fast inference engine with a sleek model won’t work if they monopolize all our system resources. Benchmarking LLM servers for frugality involves carefully measuring resource consumption during a controlled workload. Since both Ollama and Docker use llama.cpp under the hood but adds an abstraction layer, the comparison is essentially between the optimized llama.cpp core and the overhead introduced by the server wrapper (Ollama/Docker).

Since the Ampere A1 Flex instances lack GPUs, we would also expect CPU saturation (80-100% utilization across allocated cores) during runs.

Resource Metrics

We’ll monitor the host system’s overall memory before, during, and after running an inference task.

Memory

Memory use is dominated by model size (e.g., 2-3GB for 2.5B Q4 models) plus context/overhead. llama.cpp can use more if not tuned, due to default 128k token context (can be reduced with -c 8192 flag for parity with Ollama.)

Metric	Tool & Measurement	What it Indicates (Frugality)
Model footprint	`top` (for interactive monitoring) or `free -m` (Idle State, quick snapshot)	Minimum RAM required to hold the model. Measured as `buff/cache` used for the model (use `pmap -x <PID>` to see specific memory mappings) or the increase in `RES` when the server starts.
Resident Set Size (RES)	`top` (During Inference)	Active working memory (RAM) required for the inference engine, KV cache, and context buffers. A lower `RES` during inference is generally more frugal.
Peak memory usage	`ps` (with timing) or `docker stats` for containers	The absolute maximum RAM the server uses during the entire benchmark run. Crucial for provisioning.

CPU

All tools saturate CPU cores during inference. Differences are better interpreted indirectly via speed/efficiency.

Metric	Tool & Measurement	What it Indicates (Frugality)
%CPU	`top` (During Inference)	CPU utilization. A lower percentage, while achieving a target throughput, indicates better CPU efficiency (i.e., less waste).
CPU Time	`time` command wrapper (for direct `llama.cpp` runs)	Total CPU-seconds consumed to complete the job. Lower time indicates higher efficiency.

X-check w/ Container metrics

Since we are running a lot of our LLM inference servers as containers, a very straightforward tool to monitor their usage is the docker stats tool. We can cross-check with system-wide tools like free -h or cat /proc/meminfo before/after inference to see total memory and CPU usage.

Workflow

We will test a Single-User workload by run one request at a time. This measures the per-request resource cost and base memory/CPU footprint.

Before starting inference
- Open a terminal on the host machine (outside any container) and run free -h. This displays human-readable memory stats (total, used, free, shared, cache, etc.).
- In another terminal, run top. This opens an interactive view. Press Shift + M to sort by memory usage (%MEM column).
Establish baseline
- for free: Note the Used and Buff/cache columns under Mem.
- for top: Note the top processes and overall CPU and memory stats at the top of the screen (e.g., %Mem used).
Start inference

In another terminal, run your LLM server command, e.g.: docker model run ai/qwen2.5:3B-Q4_K_M.

Once it’s running and you’ve sent a prompt for inference, quickly switch back to the 2 monitoring terminals.
During inference
- Run free -h again. Compare to the baseline in Step 2:
  - Look for an increase in Used memory by ~2GB (the model weights are loaded here).
  - If the spike is in Buff/cache instead, it means the system is using page cache for the GGUF file without fully attributing it to a process. This is fairly common for mmap’d files in llama.cpp.
  - Repeat every few seconds if needed with watch -n 1 free -h for continuous monitoring.
- In top, watch for:
  - A new process (e.g.,llama-server) consuming ~2GB in the RES (resident set size) column.
  - System-wide %Mem increase by 8-10% on our 24GB system (2GB is about 8%).
  - High CPU usage on all 4 cores (e.g., showing up as ~390%) during computation, confirming active inference.
After inference

Stop the LLM server process (e.g., Ctrl+C) and run free -h once more. Memory should drop back toward the baseline, though some cache might persist until cleared by the system.

Results

In resource usage aspect, there is not a big difference between tools.

Memory usage is driven mostly by the model used, NOT the tool. For example, running the F16 model file consumes ~6G RES while the Q4 files eat up 2.1G instead. This shows that moving from 16-bit float to 4-bit integer reduces memory footprint by nearly 66%. As for the tools, llama.cpp has the lowest memory overhead, with DMR/Ollama adding only a slight ~50-300MB.

All tools satured CPU at 90-100% on each allocated OCPUs during inference with little difference among model size, API or container overhead. Differences are in efficiency (measured by tokens per core-hour), as there is no per-tool variance beyond speed.

Here’s an example of top results when I ran native llama.cpp with Qwen2.5-3B-Q4_K_4:

top - 22:51:12 up 124 days,  4:03,  2 users,  load average: 1.06, 0.94, 0.85
Tasks: 284 total,   2 running, 282 sleeping,   0 stopped,   0 zombie
%Cpu(s): 97.0 us,  0.4 sy,  0.0 ni,  2.1 id,  0.0 wa,  0.3 hi,  0.2 si,  0.1 st
MiB Mem :  22947.4 total,   8451.3 free,  14496.0 used,  12941.0 buff/cache
MiB Swap:   5120.0 total,   3917.8 free,   1202.2 used.   8451.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1145026 opc       20   0 3160720   2.1g   1.8g R 387.4   9.2   6:17.12 llama-server
1146350 opc       20   0  280052  65540  25784 S   0.7   0.3   0:00.73 python

Here’s an example of top results when I ran Ampere llama.cpp container with Qwen2.5-3B-Q4_K_4.

Resource use during inference:

[opc@test ~]$ top
top - 22:31:57 up 119 days,  3:44,  3 users,  load average: 2.84, 1.17, 0.75
Tasks: 353 total,   2 running, 350 sleeping,   1 stopped,   0 zombie
%Cpu(s): 96.8 us,  2.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.5 hi,  0.2 si,  0.2 st
MiB Mem :  22947.4 total,   1952.1 free,  20995.3 used,  19047.6 buff/cache
MiB Swap:   5120.0 total,   1758.6 free,   3361.4 used.   1952.1 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 292517 root      20   0 3208056   2.2g   2.0g R 380.1   9.8   3:54.63 llama-server
3586364 root      20   0 2628136  60168  22556 S   5.3   0.3  38:22.16 dockerd 
.....

From top, we can see that even after inference, model files are kept in cache, reducing the free memory space.

[opc@test ~]$ top
%Cpu(s):  0.6 us,  0.6 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.1 hi,  0.0 si,  0.1 st 
MiB Mem :  22947.4 total,    594.1 free,  22353.3 used,  20935.9 buff/cache 
MiB Swap:   5120.0 total,   3871.3 free,   1248.7 used.    594.1 avail Mem  

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND           
 195614 root      20   0 3729320  71560  42184 S   0.3   0.3  23:39.94 dockerd          
 504104 opc       20   0  234664  19516  18236 R   0.3   0.1   0:00.16 top         
      1 root      20   0  403708  23728  18496 S   0.0   0.1  42:08.34 systemd

And here’s Ollama during inference:

[opc@test ~]$ docker stats
CONTAINER ID   NAME                  CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
3603145f89e5   ollama                382.43%    2.5GiB / 22.41GiB     11.15%    5.23GB / 22.7MB   2.33MB / 5.2GB    31

As for DMR, it loads the model into a host-level process, not inside the container with a spike of around 2GB in the host’s used memory or cache when the model is loaded and inferencing.

[opc@test ~]$ docker stats
CONTAINER ID   NAME                  CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
641e2223baf5   docker-model-runner   378.67%   2.232GiB / 22.41GiB   9.96%     162kB / 348kB     44.5GB / 416MB    18

[opc@test ~]$ top
top - 01:21:08 up 120 days,  6:33,  3 users,  load average: 1.98, 0.91, 1.06
Tasks: 280 total,   3 running, 276 sleeping,   1 stopped,   0 zombie
%Cpu(s): 42.9 us,  3.9 sy,  0.0 ni, 52.9 id,  0.0 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem :  22947.4 total,  16518.6 free,   6428.8 used,   5596.0 buff/cache
MiB Swap:   5120.0 total,   3302.6 free,   1817.4 used.  16518.6 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 660935 systemd+  20   0 3086052   2.1g   1.8g R 160.3   9.3   4:23.72 com.docker.llam
 650828 systemd+  20   0 1654124  44084  12420 S  21.3   0.2   0:07.61 model-runner
 195614 root      20   0 3679784  71480  41084 S   0.7   0.3  24:34.19 dockerd
 207385 root      20   0 1237940  13088   9392 S   0.3   0.1   5:12.29 containerd-shim
 650805 root      20   0 1237940  14936  10908 S   0.3   0.1   0:00.73 containerd-shim
 650861 root      20   0 1819308   3624   3264 S   0.3   0.0   0:00.14 docker-proxy
 660849 opc       20   0 1848960  36644  23676 S   0.3   0.2   0:00.38 docker
 683009 root      20   0 1238196  15056   9924 S   0.3   0.1  14:17.80 containerd-shim
1825069 root      20   0 2083832  36648  21612 S   0.3   0.2  48:34.30 containerd
.....

Key takeaways

In this post, we’ve seen the CPU-only inference performance across different LLM inference servers running on an Arm-based Ampere A1 Flex instance.

In general, we see that quantizing from F16 to Q4 improves responsiveness and throughput nearly 2x.

Regarding the different inference engines, with their optimization in both the inference engine and quantization schemes, all Ampere containers lead the pack in raw throughput performance on the Ampere platform. This makes me wonder: what if Ampere ships its optimization in a non-container format so we can run an optimized llama.cpp binary directly? Will it be even faster given that there is no more containerization overhead to deal with?

Ollama and DMR add minimal serving overhead, making them near-parity on this hardware. Memory use is model-dominated, with low overhead from containers/servers. No major differences in CPU % (all saturate cores) across tools, but speed can vary by optimization and batching.

In summary, the choice of tools run down to one’s priority:

Use Ampere’s optimized llama.cpp Docker image (amperecomputingai/llama.cpp) for raw speed. It is ideal for batch inference
use Ollama for quick API serving and in-tool model switching
use DMR for integration with existing containerized workflows

What’s next

In my next post, I will take this a step further by looking at concurrency and efficiency. I’ll run multiple requests simultaneously (e.g., 4 or 8 parallel requests) against the servers to measure how well these difference inference engines share resources and minimize overhead during peak load. Stay tuned!