In my last blog post, I reviewed various options for local LLM inferencing on a ARM-based Ampere A1 Compute instance with 4-core, 24GB RAM running Linux. I have concluded that the Ampere optimized Ollama strikes a good balance of performance and ease of use in our scenario.
In this post, I will discuss the tooling - CLI tools, cURL for API interactions, and programmatic access via Python bindings. Specifically, we’ll run LLMs locally with Ollama (both the Ampere optimized version and original) and Docker Model Runner. We’ll look at llama.cpp in the next post.
Docker Model Runner
One reason I decided to write this post is because most of the tutorials as of this writing (early 2025/9) are focused on running Docker Model Runner (DMR) in Docker Desktop. I’d like to document my user scenario which is completely CLI- and programmatically-based in a Linux environment with Docker Engine as the backend.
Install
Update system and install the Docker Model Runner plugin:
sudo dnf update
sudo dnf install docker-model-pluginFor Ubuntu, just replace
dnfwithapt-get
We can verify the installation with docker model version
The docker model command is now available alongside regular Docker commands like docker run, docker ps, etc.
Run an AI Model
Now let’s run an AI model from the Hub, which serves as a curated catalog of the most popular models. We just need a single command:
docker model run ai/qwen2.5:3B-Q4_K_MNote: you can also pull and run any GGUF model file from HuggingFace.
This command automatically performs several actions:
- Downloads the qwen2.5:3B model with Q4_K_M quantization to local storage
- Starts an host-installed inference server API endpoint, and provides an OpenAI compatible API.
- Uses
llama.cppas the Inference Engine in the server. It runs as a native host process, and loadsqwen2.5:3Bon demand - Serves the model on port 12434, accessible by REST API
- Launches an interactive chat CLI To exit the chat, type
/bye.
The model will stay in memory until another model is requested.
Some useful DMR commands (that bare similarity to docker):
docker model run <model> "Write a haiku about Docker": Run a model and generates a response for this prompt, then exits.docker model pull <model>: Pull models from Docker Hub just like imagesdocker model ps: Show all running models, similar to howdocker ps: Show running containers. Now you should seeai/qwen2.5:3B-Q4_K_Mlisted and running.docker model ls: List models pulled to your local environment with more information. For example
MODEL NAME PARAMETERS QUANTIZATION ARCHITECTURE MODEL ID CREATED SIZE
ai/qwen2.5:3B-Q4_K_M 3.09 B IQ2_XXS/Q4_K_M qwen2 41045df49cc0 5 months ago 1.79 GiB
ai/llama3.2:3B-Q4_0 3.21 B Q4_0 llama da80a841836d 4 months ago 1.78 GiB
docker model rm <model>: Remove a local modeldocker model inspect <model>: Display model metadata in JSON. For example:$ docker model inspect ai/qwen2.5:3B-Q4_K_M { "id": "sha256:41045df49cc0d72a4f8c15eb6b21464d3e6f4dc2899fe8ccd9e5b72bdf4d0bf9", "tags": [ "ai/qwen2.5:3B-Q4_K_M" ], "created": 1744119140, "config": { "format": "gguf", "quantization": "IQ2_XXS/Q4_K_M", "parameters": "3.09 B", "architecture": "qwen2", "size": "1.79 GiB" } }docker model logs: Monitor model usage and debug issues
Inference
Now let’s move forward from chatting in the CLI to accessing the inference server via API.
base URL
Model Runner exposes 2 types of OpenAI endpoint - http://<host>:<port>/engines/<engine-name>/v1 via TCP for both host processes and external applications - http://model-runner.docker.internal/engines/<engine-name>/v1 for other containers using an internal DNS
Since we’ll primarily be interacting with it from a host process such as cURL or a Python script, let’s fill in each part of http://<host>:<port>/engines/<engine-name>/v1/ in the next steps.
<host>:<port>We can obtain the port in 2 ways:- Find the container that was started by
docker model runwithdocker ps. Note the followings:<container-id>- a
PORTScolumn, e.g.:0.0.0.0:12434->8000/tcp. In here,12434is theportthe container uses in the host.
- Check
docker logs <container-id>. In the first few lines, Model Runner prints something like:Listening on http://0.0.0.0:12434. That’s the base URL — replace0.0.0.0withlocalhostif you’re calling from the same machine.
The endpoint is thus
http://localhost:12434/- Find the container that was started by
<engine-name>Model Runner always scopes endpoints by
<engine-name>. Currently, it only supportsllama.cppfor most GGUF‑format models. There is talk for future support of other engines such asvllmfor Hugging Face Transformers/PyTorch models.If you’re curious, you can check
docker logs <container-id>to examine a similar lineStarting engine: llama.cppormsg="Loading llama.cpp backend runner"which confirm that the engine name isllama.cpp.
Now we have the complete base url: http://localhost:12434/engines/llama.cpp/v1/, with the /v1/... part following the same schema as OpenAI’s API.
cURL
Now that we have figured out how to connect, let’s do a simple curl to list all models.
curl http://localhost:12434/engines/llama.cpp/v1/models
You would get JSON back with the model(s) you’ve loaded
{
"object":"list",
"data":[
{
"id":"ai/qwen2.5:3B-Q4_K_M",
"object":"model",
"created":1742816981,
"owned_by":"docker"
},
{
"id":"ai/llama3.2:3B-Q4_0",
"object":"model",
"created":1745777589,
"owned_by":"docker"
}
]
}We can then generate some text!
curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json"\
-d '{
"model": "ai/qwen2.5:3B-Q4_K_M",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Write a haiku about Docker." }
]
}'
The
/chat/completionsendpoint follows the same schema as OpenAI’s API, so you can reuse existing client code. You can also hit/completionsfor plain text completion or/embeddingsfor vector embeddings.
Python
First, install the OpenAI Python client: pip install openai.
The api_key in the following script can be any string as Model Runner doesn’t validate it.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:12434/engines/llama.cpp/v1",
api_key="not-needed"
)
resp = client.chat.completions.create(
model="ai/qwen2.5:3B-Q4_K_M",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about Docker."}
]
)
print(resp.choices[0].message.content)Run python app.py. It will launch Docker Model Runner with our specified model, take our prompt and output the completion generated by the model.
Package GGUF model file
docker model package allows us to package a GGUF file as a Docker model OCI artifact for the following advantages:
Standardization: It provides a standardized way to distribute and manage AI models, similar to how container images are handled. This makes it easier to integrate models into existing workflows and CI/CD pipelines.
Portability: The packaged model can be pulled and run on any machine with Docker Model Runner, ensuring a consistent and reproducible environment.
OCI Artifacts: Models are packaged as OCI (Open Container Initiative) artifacts, an open standard supported by various registries and tools.
Simplicity: It simplifies the process of serving and interacting with models by providing a consistent CLI and an OpenAI-compatible API.
In the following command, we package an existing GGUF file and creates a new, self-contained Docker model artifact:
docker model package --gguf <path_to_gguf_file> <model_name>:<tag> - --gguf <path_to_gguf_file>: This is a required flag. Provide the absolute path to your GGUF file. If the model is sharded (e.g., model-00001-of-00003.gguf), you should point to the first shard. The command will automatically find the other shards in the same directory. - <model_name>:<tag>: This is the name and tag for your new Docker model OCI artifact (e.g., my-custom-model:v1).
You can add other options like: - --license <path_to_license_file>: include a license in the package. - --push <registry_name>/<model_name>:<tag> push it to a container registry like Docker Hub.
This will package and then immediately push the OCI artifact to the specified registry.
Ampere optimized Ollama
Ampere has tweaked the original Ollama engine for performance improvements. We’ll see if the claim is true in our benchmark post later.
First, let’s get it running.
We will use the qwen2.5:3b model with the same quantization as the one from Docker hub.
We will run it as a container
docker run --privileged=true --name ollama -p 11434:11434 amperecomputingai/ollama:latestIn a separate shell, run
docker exec -it ollama bash.Once inside the shell, run
ollama run qwen2.5:3bto pull and run the model. The model is stored in~/.ollama/models.
Similar to Docker Model Runner, Ollama exposes a simple REST API (e.g., at http://localhost:11434) that’s OpenAI-compatible. You can access it directly with Python libraries like openai or requests, or the Python binding for programmatic access.
We can reuse the above code and just change the base URL to base_url="http://localhost:11434/v1".
In addition to pulling from Ollama’s curated model library, you can also run any GGUF model files from Hugging Face. By default, a model file with Q4_K_M quantization scheme will be used. If you are using a chat (instruction) model, a chat template based on the built-in tokenizer.chat_template metadata stored inside the GGUF file will be used.
Original Ollama
For comparison, we can also run the original Ollama engine as a container
docker run -d --privileged=true -v ollama:/root/.ollama -p 11400:11434 --name ollama2 ollama/ollama
docker exec -it ollama2 ollama run qwen2.5:3bTo run it as a standalone app for arm64:
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Run a model server Use
ollama serve(background) orollama run <model>for interactive testing. - Serve a web app Run Ollama as a daemon (
systemdservice) for always-on inference. - Customization Use
OLLAMA_NUM_PARALLELenvironment variable to limit concurrent requests (e.g., 1-2 for A1 Flex’s 4 cores).
Build a Python chatbot
Now that we know how to connect to both Ampere optimized Ollama and Docker Model Runner, we can build some interactive functionalities into our previous Python script so we can pick a model to chat with. We will take advantage of the OpenAI API compatibility of both service to streamline and reuse code.
When you run the following script, it will:
- query BOTH Docker Model Runner and Ampere optimized Ollama for their available models using OpenAI’s
client.models.list(). This basically exposes aGET /v1/modelsendpoint and returns the list of loaded models in OpenAI‑style JSON. - Display each model ID with a number and its client.
- Let you choose one, and use that model for the chat request.
- Prompt you for your message, which is inserted into the
"role": "user"message. - Send the request and prints the reply.
- Pick a different model without restarting.
from openai import OpenAI
# --- Config ---
# Adjust MODEL_RUNNER_BASE depending on where you run it:
# From another container: use `host.docker.internal`
# On host: use `localhost`
MODEL_RUNNER_BASE = "http://localhost:12434/engines/llama.cpp/v1"
OLLAMA_OPENAI_BASE = "http://localhost:11434/v1"
API_KEY = "not-needed"
# 1. Create OpenAI clients for both backends
mr_client = OpenAI(base_url=MODEL_RUNNER_BASE, api_key=API_KEY)
ollama_client = OpenAI(base_url=OLLAMA_OPENAI_BASE, api_key=API_KEY)
def get_models(client, source_name):
try:
resp = client.models.list()
return [{"id": m.id, "source": source_name, "client": client} for m in resp.data]
except Exception as e:
print(f"Error fetching models from {source_name}: {e}")
return []
# 2. Fetch available models from both sources
models = get_models(mr_client, "model_runner") + get_models(ollama_client, "ollama")
if not models:
print("No models found. Make sure Model Runner and/or Ollama are running in OpenAI mode.")
exit(1)
while True:
# 3. Show combined list and let user choose
print("\nAvailable models:")
for idx, m in enumerate(models, start=1):
print(f"{idx}. {m['id']} ({m['source']})")
choice = input(f"Select a model [1-{len(models)}] or 'q' to quit: ").strip()
if choice.lower() == 'q':
break
try:
model_idx = int(choice) - 1
if model_idx < 0 or model_idx >= len(models):
raise ValueError
except ValueError:
print("Invalid selection.")
continue
selected = models[model_idx]
print(f"Using model: {selected['id']} from {selected['source']}")
while True:
# 4. Ask for user prompt
user_prompt = input("Enter your prompt (or 'back' to choose another model): ").strip()
if user_prompt.lower() == 'back':
break
if not user_prompt:
continue
try:
# 5. Send the request
resp = selected["client"].chat.completions.create(
model=selected["id"],
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": user_prompt}
]
)
# 6. Print the model's reply
print("\nModel reply:\n")
print(resp.choices[0].message.content)
except Exception as e:
print(f"Error querying {selected['source']}: {e}")We can run it with python chat.py. Now you can:
- Pick model 1, send multiple prompts
- Type
backto return to model selection - Pick another without restarting the container
My result looks like this
Available models:
1. ai/qwen2.5:3B-Q4_K_M (model_runner)
2. ai/llama3.2:3B-Q4_0 (model_runner)
3. qwen2.5:3b (ollama)
4. hf.co/AmpereComputing/llama-3.2-3b-instruct-gguf:Llama-3.2-3B-Instruct-Q8R16.gguf (ollama)
Select a model [1-4] or 'q' to quit: 1
Using model: ai/qwen2.5:3B-Q4_K_M from model_runner
Enter your prompt (or 'back' to choose another model): write a haiku for local LLM
Model reply:
Local LLM whispers,
Infinite knowledge flows through,
Wordsmith of words.
Enter your prompt (or 'back' to choose another model): Improve multi-model serving in Model Runner
Even though client.models.list() will show you multiple models, Docker Model Runner doesn’t truly run them all at once. It loads one into memory at a time, and if you call a different one, it needs to perform these tasks sequentially:
- Unload the current model from the
llama.cppbackend - Load the new one from disk
- Initialize it in memory
Depending on model size and hardware, this can take a long time or appear to hang.
From Docker’s own issue tracker, true concurrent multi‑model serving isn’t fully supported yet. Right now, switching models mid‑session can cause long cold‑start delays or even timeouts if the backend doesn’t handle the swap cleanly.
To avoid the delay, we have a few options
- Restart Model Runner with the desired model If you only need one model at a time, stop the container and start it again with the new model before running your client.
- Run each model in its own Model Runner container (multi‑container mapping) In this approach, we will not swap models inside a single Model Runner container. Instead, we’ll run each model in its own dedicated Model Runner container on a different port, and have the Python client map each model name to its own
base_url.
For example, run these in different ports
docker run -d
--name llama3.2
-p 12434:12434
docker/model-runner:latest ai/llama3.2:3B-Q4_0docker run -d
--name qwen3b
-p 12435:12434
docker/model-runner:latest ai/qwen2.5:3B-Q4_K_MNow each model is already loaded in its own container. And in the Python client below, we hard code a mapping of model IDs to their container base URLs. When we pick a model, it automatically points to the correct container/port. You can switch instantly between models without restarting anything or the cold‑start delay. That’ll be the smoothest experience until Docker ships true multi‑model support.
# Map model IDs to their dedicated container base URLs
MODEL_ENDPOINTS = {
"ai/llama3.2:3B-Q4_0": "http://localhost:12434/engines/llama.cpp/v1",
"ai/qwen2.5:3B-Q4_K_M": "http://localhost:12435/engines/llama.cpp/v1"
}
models = list(MODEL_ENDPOINTS.keys())
while True:
# Show models and let user choose
print("\nAvailable models:")
for idx, model_id in enumerate(models, start=1):
print(f"{idx}. {model_id}")
choice = input(f"Select a model [1-{len(models)}] or 'q' to quit: ").strip()
if choice.lower() == 'q':
break
try:
model_idx = int(choice) - 1
if model_idx < 0 or model_idx >= len(models):
raise ValueError
except ValueError:
print("Invalid selection.")
continue
selected_model = models[model_idx]
base_url = MODEL_ENDPOINTS[selected_model]
client = OpenAI(base_url=base_url, api_key=API_KEY)
print(f"Using model: {selected_model}")Monitor resource usage
When we run our Python client in one terminal, we can monitor CPU/RAM usage from another terminal using top or htop.
If we are using Docker Model Runner to manage models, or the Ampere optimized Ollama container, we can also monitor their usage using docker stats or filter with the container’s name by docker stats qwen3b in another terminal too.