In my last blog posts, we’ve compared several options for running LLMs locally in an Oracle Cloud Infrastructure (OCI) Ampere-based Linux VM. The Ampere A1 Flex instances on OCI are ARM64-based, CPU-only shapes powered by Ampere Altra processors. They’re excellent for parallel CPU inference and produce efficient per-core performance without GPUs. It’s also cost-effective at ~$0.01/OCPU-hour.
We then moved on to see Ollama (both the Ampere optimized container and original), Docker Model Runner (DMR) and llama.cpp (both the Ampere optimized container and original) in action on the free-tier Ampere A1 Flex with 4 cores and 24GB RAM. All of them used GGUF-quantized models for efficiency. Ollama is a user-friendly wrapper around llama.cpp, while DMR is a Docker-native tool (introduced in 2025) that also leverages llama.cpp for containerized runs. This makes Ollama and DMR very similar in resource use (i.e., the overhead from Ollama’s API server and DMR’s containerization), with llama.cpp offering raw performance but requiring more manual tuning.
In this post, let’s benchmark their performance and resource usage. We will look at how fast they are able to serve a model and return message to a client. More importantly, we will also compare a F16 model file with its 4-bit quantized version to check the gain in performance. Finally, we will also see how much system resource each inference engine consumes to do their job.
Let’s go!
Methodology
1. Standardized environment
To get meaningful results, we must eliminate external variables with the following steps:
- Model: Use the exact same model file (
qwen2.5:3b) with the exact same quantization level (Q4_K_M) for all servers. For Ollama, the default quant for this model is Q4_K_M. For all others, the quant is explicitly specified in the run. In the last 3 runs, we will run the Ampere llama.cpp container with its own optimized quants to verify its performance claim. - Context: Use the same host machine - Ampere A1 Flex shape with Oracle Linux (CPU, RAM, OS) for all tests.
- Prompt/Output: Use a fixed test case:
- Prompt Text: Use the same text for every run.
- Warming up: Run the inference once on each server before recording metrics. This forces the model to load into memory/cache and ensures we’re measuring a steady state.
2. LLM inference engines
| Engine | Configuration & Focus |
|---|---|
llama.cpp (Direct) |
Run the bare-metal llama-server executable with explicit flags (--n-threads 4, --n-gpu-layers 0 for CPU-only, --ctx-size 2048, etc.). This gives us maximum control and minimum overhead. |
| Ollama | Use ollama run <model>. The simplest approach, focused on ease of use. We are measuring llama.cpp + the Ollama service overhead. |
| Docker Model Runner | Use docker run <model>. We are measuring llama.cpp + containerization overheads. |
| Ampere optimized Ollama | Run the Ampere optimized Ollama Docker container. We are measuring llama.cpp + the Ollama service + containerization overheads. |
| Ampere optimized llama.cpp | Run the Ampere optimized llama.cpp Docker container. We are measuring llama.cpp + containerization overheads. |
Benchmark inference
Performance Metrics
LLM performance is measured in two distinct phases, and both must be tracked to gauge user experience (UX) and system throughput.
| Metric | Phase | Description | Key Factors | UX Impact |
|---|---|---|---|---|
| Time To First Token (TTFT) | Prefill (prompt evaluation) | Duration from when a user sends a prompt to when they receive the 1st response character. | Prompt length (reading, tokenizing, and computing the Key-Value (KV) cache for the entire input prompt), system load/queueing, network latency | Perceived responsiveness Does the model feel fast to start? |
| Tokens Per Second (TPS) | Decoding (Output generation) | The rate at which the model streams subsequent tokens (excluding TTFT). This is the sustained throughput. | GPU memory bandwidth, KV cache efficiency, model size. | Smoothness/Flow of the stream (Is the output consistent?) |
| Total Latency | complete request-response cycle | TTFT + time to generate remaining tokens | Combination of all factors. | Overall wait time. |
Workflow
We will use a Python OpenAI client to send the same prompt to our LLM inference servers (each denote with its own base_url) multiple times, ignore the first “warm-up” run. We will then average the results for a fair comparison. The script will print the results for both latency and throughput.
Prerequisites: Make sure you have Python and the
openailibrary installed withpip install openai
Open the Python script and change the MODEL_NAME variables to the exact model you are using (e.g., llama3, mistral, etc.) for each engine. Then run the script from your terminal by python benchmark_llm.py.
import openai
import time
import os
# --- Configuration ---
# Set the specific model names for each service
MODEL_NAME = "qwen2.5:3b"
# Default OpenAI-compatible endpoints
BASE_URL = "http://localhost:8085/v1"
# Benchmark settings
PROMPT = "Explain the significance of Meiji Restoration in about 150 words."
NUM_RUNS = 5 # Number of times to run the benchmark for averaging (plus 1 warm-up)
# --- Clients ---
llm_client = openai.OpenAI(
base_url=BASE_URL,
api_key="not needed", # required but not used
)
def benchmark_endpoint(client: openai.OpenAI, model_name: str, service_name: str):
"""Runs a benchmark against a single endpoint and returns the metrics."""
print(f"\n--- Benchmarking {service_name} (Model: {model_name}) ---")
results = []
for i in range(NUM_RUNS):
print(f" Run {i+1}/{NUM_RUNS}...", end="", flush=True)
start_time = time.time()
first_token_time = None
token_count = 0
try:
stream = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": PROMPT}],
stream=True,
)
for chunk in stream:
if not getattr(chunk, "choices", None):
continue
delta = getattr(chunk.choices[0], "delta", None)
if not delta or not getattr(delta, "content", None):
continue
if first_token_time is None:
first_token_time = time.time()
token_count += 1
end_time = time.time()
total_time = end_time - start_time
ttft = (first_token_time - start_time) * 1000 # in milliseconds
# Avoid division by zero if response is instant
generation_time = end_time - first_token_time if first_token_time else total_time
tps = token_count / generation_time if generation_time > 0 else float('inf')
# We ignore the first run as it's a "warm-up"
if i > 0:
results.append({"ttft": ttft, "tps": tps, "total_time": total_time, "tokens": token_count})
print(f" Done. TTFT: {ttft:.2f} ms, TPS: {tps:.2f} t/s")
except Exception as e:
print(f" FAILED. Could not connect to {service_name}. Is it running? Error: {e}")
return None # Exit if a connection fails
return results
def main():
"""Main function to run and print benchmark results."""
results = benchmark_endpoint(llm_client, MODEL_NAME, "LLM Server 1")
if not results:
print("\nBenchmark could not be completed due to connection errors.")
return
# Calculate averages
avg_ttft = sum(r['ttft'] for r in results) / len(results)
avg_tps = sum(r['tps'] for r in results) / len(results)
print("\n--- Benchmark Results ---")
print(f"Runs per Service (excluding warm-up): {NUM_RUNS - 1}")
print("-" * 40)
print(f"🟢 LLM Server 1 (Model: {MODEL_NAME}):")
print(f" - Average Time to First Token (TTFT): {avg_ttft:.2f} ms")
print(f" - Average Tokens Per Second (TPS): {avg_tps:.2f} t/s")
print("-" * 40)
if __name__ == "__main__":
main()Results
To illustrate, here’s the result we got from llama.cpp running model Qwen2.5-3B-Q4_K_M:
--- Benchmarking llama.cpp Original (Model: Qwen2.5-3B-Q4_K_M) ---
Run 1/5... Done. TTFT: 1760.62 ms, TPS: 10.73 t/s
Run 2/5... Done. TTFT: 94.08 ms, TPS: 10.64 t/s
Run 3/5... Done. TTFT: 92.15 ms, TPS: 10.29 t/s
Run 4/5... Done. TTFT: 92.35 ms, TPS: 10.56 t/s
Run 5/5... Done. TTFT: 99.50 ms, TPS: 10.79 t/s
--- Benchmark Results ---
Runs per Service (excluding warm-up): 4🟢
----------------------------------------
llama.cpp Original (Model: Qwen2.5-3B-Q4_K_M):
Average Time to First Token (TTFT): 94.52 ms
Average Tokens Per Second (TPS): 10.57 t/s
After running the test on all our services, here’s a summary table consolidating all our runs:
Engine | Model | TTFT (ms) |TPS (t/s)
---------|----------|---------|---------
**llama.cpp** | **qwen2.5:3B-f16** | **173.60** | **5.72**
Ollama Container | qwen2.5:3b | 154.35 |9.04
Ollama Original | qwen2.5:3b | 225.62 |7.81
Ampere Ollama container | Qwen2.5-3B-GGUF:Q4_K_M | 203.46 |11.84
Docker Model Runner | ai/qwen2.5:3B-Q4_K_M | 699.81 |8.26
llama.cpp | Qwen2.5-3B-GGUF:Q4_K_M | 94.52 |10.57
Ampere llama.cpp container | Qwen2.5-3B-Q4_K_M | 89.29 |13.36
Ampere llama.cpp container | Qwen2.5-3B-Q4_K_4 | 483.83 |15.48
Ampere llama.cpp container | Qwen2.5-3B-Q8R16 | 351.88 |14.33
Ampere llama.cpp container | Qwen3-8B-Q8R16 | 422.75 |6.13
We first ran llama.cpp against the F16 model file which is the closest we can get to the original model. This establishes a baseline for our subsequent quantized runs.
Compared with running F16, llama.cpp with Q4_K_M nearly halves TTFT while doubles TPS, showing the benefits of quantization:
- TTFT (94.52 ms): To recall, TTFT for a request is largely the time taken to run the Prompt Evaluation (pre-fill). 94.52 ms is a very efficient TTFT for a ≈27-token prompt on a small CPU model, representing a per-token pre-fill time of ≈3.5 ms/token.
- TPS (10.57 TPS): TPS is the speed of token generation (decode), which is memory-bound. 10.57 TPS was the system’s baseline memory bandwidth limit.
Not surprisingly, the Ampere containers (both Ollama and llama.cpp) are the most performant in total throughput, with the optimized Ampere llama.cpp container running Q4_K_4 and Q8R16 both leading the pack in throughput. Without hardware optimization, native llama.cpp beats both Ollama and DMR, due to the latter twins’ API/container overheads wrapping around llama.cpp.
It’s also worth noting that both llama.cpp and the Ampere optimized container excels in responsiveness during prompt evaluation. When we switch to the Ampere quant schemes, responsiveness drops significantly.
Benchmark resource usage
Even a fast inference engine with a sleek model won’t work if they monopolize all our system resources. Benchmarking LLM servers for frugality involves carefully measuring resource consumption during a controlled workload. Since both Ollama and Docker use llama.cpp under the hood but adds an abstraction layer, the comparison is essentially between the optimized llama.cpp core and the overhead introduced by the server wrapper (Ollama/Docker).
Since the Ampere A1 Flex instances lack GPUs, we would also expect CPU saturation (80-100% utilization across allocated cores) during runs.
Resource Metrics
We’ll monitor the host system’s overall memory before, during, and after running an inference task.
Memory
Memory use is dominated by model size (e.g., 2-3GB for 2.5B Q4 models) plus context/overhead. llama.cpp can use more if not tuned, due to default 128k token context (can be reduced with -c 8192 flag for parity with Ollama.)
| Metric | Tool & Measurement | What it Indicates (Frugality) |
|---|---|---|
| Model footprint | top (for interactive monitoring) or free -m (Idle State, quick snapshot) |
Minimum RAM required to hold the model. Measured as buff/cache used for the model (use pmap -x <PID> to see specific memory mappings) or the increase in RES when the server starts. |
| Resident Set Size (RES) | top (During Inference) |
Active working memory (RAM) required for the inference engine, KV cache, and context buffers. A lower RES during inference is generally more frugal. |
| Peak memory usage | ps (with timing) or docker stats for containers |
The absolute maximum RAM the server uses during the entire benchmark run. Crucial for provisioning. |
CPU
All tools saturate CPU cores during inference. Differences are better interpreted indirectly via speed/efficiency.
| Metric | Tool & Measurement | What it Indicates (Frugality) |
|---|---|---|
| %CPU | top (During Inference) |
CPU utilization. A lower percentage, while achieving a target throughput, indicates better CPU efficiency (i.e., less waste). |
| CPU Time | time command wrapper (for direct llama.cpp runs) |
Total CPU-seconds consumed to complete the job. Lower time indicates higher efficiency. |
X-check w/ Container metrics
Since we are running a lot of our LLM inference servers as containers, a very straightforward tool to monitor their usage is the docker stats tool. We can cross-check with system-wide tools like free -h or cat /proc/meminfo before/after inference to see total memory and CPU usage.
Workflow
We will test a Single-User workload by run one request at a time. This measures the per-request resource cost and base memory/CPU footprint.
Before starting inference
Open a terminal on the host machine (outside any container) and run
free -h. This displays human-readable memory stats (total, used, free, shared, cache, etc.).In another terminal, run
top. This opens an interactive view. Press Shift + M to sort by memory usage (%MEMcolumn).
Establish baseline
for
free: Note theUsedandBuff/cachecolumns underMem.for
top: Note the top processes and overall CPU and memory stats at the top of the screen (e.g.,%Memused).
Start inference
In another terminal, run your LLM server command, e.g.:
docker model run ai/qwen2.5:3B-Q4_K_M.Once it’s running and you’ve sent a prompt for inference, quickly switch back to the 2 monitoring terminals.
During inference
- Run
free -hagain. Compare to the baseline in Step 2:- Look for an increase in
Usedmemory by ~2GB (the model weights are loaded here). - If the spike is in
Buff/cacheinstead, it means the system is using page cache for the GGUF file without fully attributing it to a process. This is fairly common for mmap’d files in llama.cpp. - Repeat every few seconds if needed with
watch -n 1 free -hfor continuous monitoring.
- Look for an increase in
- In
top, watch for:- A new process (e.g.,
llama-server) consuming ~2GB in theRES(resident set size) column. - System-wide
%Memincrease by 8-10% on our 24GB system (2GB is about 8%). - High CPU usage on all 4 cores (e.g., showing up as ~
390%) during computation, confirming active inference.
- A new process (e.g.,
- Run
After inference
Stop the LLM server process (e.g., Ctrl+C) and run
free -honce more. Memory should drop back toward the baseline, though some cache might persist until cleared by the system.
Results
In resource usage aspect, there is not a big difference between tools.
Memory usage is driven mostly by the model used, NOT the tool. For example, running the F16 model file consumes ~6G RES while the Q4 files eat up 2.1G instead. This shows that moving from 16-bit float to 4-bit integer reduces memory footprint by nearly 66%. As for the tools, llama.cpp has the lowest memory overhead, with DMR/Ollama adding only a slight ~50-300MB.
All tools satured CPU at 90-100% on each allocated OCPUs during inference with little difference among model size, API or container overhead. Differences are in efficiency (measured by tokens per core-hour), as there is no per-tool variance beyond speed.
Here’s an example of top results when I ran native llama.cpp with Qwen2.5-3B-Q4_K_4:
top - 22:51:12 up 124 days, 4:03, 2 users, load average: 1.06, 0.94, 0.85
Tasks: 284 total, 2 running, 282 sleeping, 0 stopped, 0 zombie
%Cpu(s): 97.0 us, 0.4 sy, 0.0 ni, 2.1 id, 0.0 wa, 0.3 hi, 0.2 si, 0.1 st
MiB Mem : 22947.4 total, 8451.3 free, 14496.0 used, 12941.0 buff/cache
MiB Swap: 5120.0 total, 3917.8 free, 1202.2 used. 8451.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1145026 opc 20 0 3160720 2.1g 1.8g R 387.4 9.2 6:17.12 llama-server
1146350 opc 20 0 280052 65540 25784 S 0.7 0.3 0:00.73 python
Here’s an example of top results when I ran Ampere llama.cpp container with Qwen2.5-3B-Q4_K_4.
Resource use during inference:
[opc@test ~]$ top
top - 22:31:57 up 119 days, 3:44, 3 users, load average: 2.84, 1.17, 0.75
Tasks: 353 total, 2 running, 350 sleeping, 1 stopped, 0 zombie
%Cpu(s): 96.8 us, 2.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.5 hi, 0.2 si, 0.2 st
MiB Mem : 22947.4 total, 1952.1 free, 20995.3 used, 19047.6 buff/cache
MiB Swap: 5120.0 total, 1758.6 free, 3361.4 used. 1952.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
292517 root 20 0 3208056 2.2g 2.0g R 380.1 9.8 3:54.63 llama-server
3586364 root 20 0 2628136 60168 22556 S 5.3 0.3 38:22.16 dockerd
.....
From top, we can see that even after inference, model files are kept in cache, reducing the free memory space.
[opc@test ~]$ top
%Cpu(s): 0.6 us, 0.6 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.1 hi, 0.0 si, 0.1 st
MiB Mem : 22947.4 total, 594.1 free, 22353.3 used, 20935.9 buff/cache
MiB Swap: 5120.0 total, 3871.3 free, 1248.7 used. 594.1 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
195614 root 20 0 3729320 71560 42184 S 0.3 0.3 23:39.94 dockerd
504104 opc 20 0 234664 19516 18236 R 0.3 0.1 0:00.16 top
1 root 20 0 403708 23728 18496 S 0.0 0.1 42:08.34 systemd
And here’s Ollama during inference:
[opc@test ~]$ docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
3603145f89e5 ollama 382.43% 2.5GiB / 22.41GiB 11.15% 5.23GB / 22.7MB 2.33MB / 5.2GB 31
As for DMR, it loads the model into a host-level process, not inside the container with a spike of around 2GB in the host’s used memory or cache when the model is loaded and inferencing.
[opc@test ~]$ docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
641e2223baf5 docker-model-runner 378.67% 2.232GiB / 22.41GiB 9.96% 162kB / 348kB 44.5GB / 416MB 18
[opc@test ~]$ top
top - 01:21:08 up 120 days, 6:33, 3 users, load average: 1.98, 0.91, 1.06
Tasks: 280 total, 3 running, 276 sleeping, 1 stopped, 0 zombie
%Cpu(s): 42.9 us, 3.9 sy, 0.0 ni, 52.9 id, 0.0 wa, 0.2 hi, 0.1 si, 0.0 st
MiB Mem : 22947.4 total, 16518.6 free, 6428.8 used, 5596.0 buff/cache
MiB Swap: 5120.0 total, 3302.6 free, 1817.4 used. 16518.6 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
660935 systemd+ 20 0 3086052 2.1g 1.8g R 160.3 9.3 4:23.72 com.docker.llam
650828 systemd+ 20 0 1654124 44084 12420 S 21.3 0.2 0:07.61 model-runner
195614 root 20 0 3679784 71480 41084 S 0.7 0.3 24:34.19 dockerd
207385 root 20 0 1237940 13088 9392 S 0.3 0.1 5:12.29 containerd-shim
650805 root 20 0 1237940 14936 10908 S 0.3 0.1 0:00.73 containerd-shim
650861 root 20 0 1819308 3624 3264 S 0.3 0.0 0:00.14 docker-proxy
660849 opc 20 0 1848960 36644 23676 S 0.3 0.2 0:00.38 docker
683009 root 20 0 1238196 15056 9924 S 0.3 0.1 14:17.80 containerd-shim
1825069 root 20 0 2083832 36648 21612 S 0.3 0.2 48:34.30 containerd
.....
Key takeaways
In this post, we’ve seen the CPU-only inference performance across different LLM inference servers running on an Arm-based Ampere A1 Flex instance.
In general, we see that quantizing from F16 to Q4 improves responsiveness and throughput nearly 2x.
Regarding the different inference engines, with their optimization in both the inference engine and quantization schemes, all Ampere containers lead the pack in raw throughput performance on the Ampere platform. This makes me wonder: what if Ampere ships its optimization in a non-container format so we can run an optimized llama.cpp binary directly? Will it be even faster given that there is no more containerization overhead to deal with?
Ollama and DMR add minimal serving overhead, making them near-parity on this hardware. Memory use is model-dominated, with low overhead from containers/servers. No major differences in CPU % (all saturate cores) across tools, but speed can vary by optimization and batching.
In summary, the choice of tools run down to one’s priority:
- Use Ampere’s optimized llama.cpp Docker image (amperecomputingai/llama.cpp) for raw speed. It is ideal for batch inference
- use Ollama for quick API serving and in-tool model switching
- use DMR for integration with existing containerized workflows
What’s next
In my next post, I will take this a step further by looking at concurrency and efficiency. I’ll run multiple requests simultaneously (e.g., 4 or 8 parallel requests) against the servers to measure how well these difference inference engines share resources and minimize overhead during peak load. Stay tuned!