Serve and inference with local LLMs via Ollama & Docker Model Runner in Oracle Ampere

In my last blog post, I reviewed various options for local LLM inferencing on a ARM-based Ampere A1 Compute instance with 4-core, 24GB RAM running Linux. I have concluded that the Ampere optimized Ollama strikes a good balance of performance and ease of use in our scenario.

In this post, I will discuss the tooling - CLI tools, cURL for API interactions, and programmatic access via Python bindings. Specifically, we’ll run LLMs locally with Ollama (both the Ampere optimized version and original) and Docker Model Runner. We’ll look at llama.cpp in the next post.

Docker Model Runner

One reason I decided to write this post is because most of the tutorials as of this writing (early 2025/9) are focused on running Docker Model Runner (DMR) in Docker Desktop. I’d like to document my user scenario which is completely CLI- and programmatically-based in a Linux environment with Docker Engine as the backend.

Install

Update system and install the Docker Model Runner plugin:

sudo dnf update
sudo dnf install docker-model-plugin

For Ubuntu, just replace dnf with apt-get

We can verify the installation with docker model version

The docker model command is now available alongside regular Docker commands like docker run, docker ps, etc.

Run an AI Model

Now let’s run an AI model from the Hub, which serves as a curated catalog of the most popular models. We just need a single command:

docker model run ai/qwen2.5:3B-Q4_K_M

Note: you can also pull and run any GGUF model file from HuggingFace.

This command automatically performs several actions:

Downloads the qwen2.5:3B model with Q4_K_M quantization to local storage
Starts an host-installed inference server API endpoint, and provides an OpenAI compatible API.
Uses llama.cpp as the Inference Engine in the server. It runs as a native host process, and loads qwen2.5:3B on demand
Serves the model on port 12434, accessible by REST API
Launches an interactive chat CLI To exit the chat, type /bye.

The model will stay in memory until another model is requested.

Some useful DMR commands (that bare similarity to docker):

docker model run <model> "Write a haiku about Docker": Run a model and generates a response for this prompt, then exits.
docker model pull <model>: Pull models from Docker Hub just like images
docker model ps: Show all running models, similar to how docker ps: Show running containers. Now you should see ai/qwen2.5:3B-Q4_K_M listed and running.
docker model ls: List models pulled to your local environment with more information. For example

MODEL NAME                               PARAMETERS  QUANTIZATION    ARCHITECTURE  MODEL ID      CREATED       SIZE
ai/qwen2.5:3B-Q4_K_M                     3.09 B      IQ2_XXS/Q4_K_M  qwen2         41045df49cc0  5 months ago  1.79 GiB
ai/llama3.2:3B-Q4_0                      3.21 B      Q4_0            llama         da80a841836d  4 months ago  1.78 GiB

docker model rm <model>: Remove a local model

docker model inspect <model>: Display model metadata in JSON. For example:

$ docker model inspect ai/qwen2.5:3B-Q4_K_M
{
    "id": "sha256:41045df49cc0d72a4f8c15eb6b21464d3e6f4dc2899fe8ccd9e5b72bdf4d0bf9",
    "tags": [
        "ai/qwen2.5:3B-Q4_K_M"
    ],
    "created": 1744119140,
    "config": {
        "format": "gguf",
        "quantization": "IQ2_XXS/Q4_K_M",
        "parameters": "3.09 B",
        "architecture": "qwen2",
        "size": "1.79 GiB"
    }
}

docker model logs: Monitor model usage and debug issues

Inference

Now let’s move forward from chatting in the CLI to accessing the inference server via API.

base URL

Model Runner exposes 2 types of OpenAI endpoint - http://<host>:<port>/engines/<engine-name>/v1 via TCP for both host processes and external applications - http://model-runner.docker.internal/engines/<engine-name>/v1 for other containers using an internal DNS

Since we’ll primarily be interacting with it from a host process such as cURL or a Python script, let’s fill in each part of http://<host>:<port>/engines/<engine-name>/v1/ in the next steps.

<host>:<port> We can obtain the port in 2 ways:
- Find the container that was started by docker model run with docker ps. Note the followings:
  - <container-id>
  - a PORTS column, e.g.: 0.0.0.0:12434->8000/tcp. In here,12434 is the port the container uses in the host.
- Check docker logs <container-id>. In the first few lines, Model Runner prints something like: Listening on http://0.0.0.0:12434. That’s the base URL — replace 0.0.0.0 with localhost if you’re calling from the same machine.
The endpoint is thus http://localhost:12434/
<engine-name>

Model Runner always scopes endpoints by <engine-name>. Currently, it only supports llama.cpp for most GGUF‑format models. There is talk for future support of other engines such as vllm for Hugging Face Transformers/PyTorch models.

If you’re curious, you can check docker logs <container-id> to examine a similar line Starting engine: llama.cpp or msg="Loading llama.cpp backend runner" which confirm that the engine name is llama.cpp.

Now we have the complete base url: http://localhost:12434/engines/llama.cpp/v1/, with the /v1/... part following the same schema as OpenAI’s API.

cURL

Now that we have figured out how to connect, let’s do a simple curl to list all models.

curl http://localhost:12434/engines/llama.cpp/v1/models

You would get JSON back with the model(s) you’ve loaded

{
  "object":"list",
  "data":[
    {
      "id":"ai/qwen2.5:3B-Q4_K_M",
      "object":"model",
      "created":1742816981,
      "owned_by":"docker"
    },
    { 
      "id":"ai/llama3.2:3B-Q4_0",
      "object":"model",
      "created":1745777589,
      "owned_by":"docker"
    }
  ]
}

We can then generate some text!

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
-H "Content-Type: application/json"\
-d '{
        "model": "ai/qwen2.5:3B-Q4_K_M",
        "messages": [
        { "role": "system", "content": "You are a helpful assistant." },
        { "role": "user", "content": "Write a haiku about Docker." }
        ]
    }'

The /chat/completions endpoint follows the same schema as OpenAI’s API, so you can reuse existing client code. You can also hit /completions for plain text completion or /embeddings for vector embeddings.

Python

First, install the OpenAI Python client: pip install openai.

The api_key in the following script can be any string as Model Runner doesn’t validate it.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:12434/engines/llama.cpp/v1",
    api_key="not-needed" 
)

resp = client.chat.completions.create(
    model="ai/qwen2.5:3B-Q4_K_M",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about Docker."}
    ]
)
print(resp.choices[0].message.content)

Run python app.py. It will launch Docker Model Runner with our specified model, take our prompt and output the completion generated by the model.

Package GGUF model file

docker model package allows us to package a GGUF file as a Docker model OCI artifact for the following advantages:

Standardization: It provides a standardized way to distribute and manage AI models, similar to how container images are handled. This makes it easier to integrate models into existing workflows and CI/CD pipelines.
Portability: The packaged model can be pulled and run on any machine with Docker Model Runner, ensuring a consistent and reproducible environment.
OCI Artifacts: Models are packaged as OCI (Open Container Initiative) artifacts, an open standard supported by various registries and tools.
Simplicity: It simplifies the process of serving and interacting with models by providing a consistent CLI and an OpenAI-compatible API.

In the following command, we package an existing GGUF file and creates a new, self-contained Docker model artifact:

docker model package --gguf <path_to_gguf_file> <model_name>:<tag> - --gguf <path_to_gguf_file>: This is a required flag. Provide the absolute path to your GGUF file. If the model is sharded (e.g., model-00001-of-00003.gguf), you should point to the first shard. The command will automatically find the other shards in the same directory. - <model_name>:<tag>: This is the name and tag for your new Docker model OCI artifact (e.g., my-custom-model:v1).

You can add other options like: - --license <path_to_license_file>: include a license in the package. - --push <registry_name>/<model_name>:<tag> push it to a container registry like Docker Hub.

This will package and then immediately push the OCI artifact to the specified registry.

Ampere optimized Ollama

Ampere has tweaked the original Ollama engine for performance improvements. We’ll see if the claim is true in our benchmark post later.

First, let’s get it running.

We will use the qwen2.5:3b model with the same quantization as the one from Docker hub.

We will run it as a container

docker run --privileged=true --name ollama -p 11434:11434 amperecomputingai/ollama:latest

In a separate shell, run docker exec -it ollama bash.
Once inside the shell, run ollama run qwen2.5:3b to pull and run the model. The model is stored in ~/.ollama/models.

Similar to Docker Model Runner, Ollama exposes a simple REST API (e.g., at http://localhost:11434) that’s OpenAI-compatible. You can access it directly with Python libraries like openai or requests, or the Python binding for programmatic access.

We can reuse the above code and just change the base URL to base_url="http://localhost:11434/v1".

In addition to pulling from Ollama’s curated model library, you can also run any GGUF model files from Hugging Face. By default, a model file with Q4_K_M quantization scheme will be used. If you are using a chat (instruction) model, a chat template based on the built-in tokenizer.chat_template metadata stored inside the GGUF file will be used.

Original Ollama

For comparison, we can also run the original Ollama engine as a container

docker run -d --privileged=true -v ollama:/root/.ollama -p 11400:11434 --name ollama2 ollama/ollama
docker exec -it ollama2 ollama run qwen2.5:3b

To run it as a standalone app for arm64:

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Run a model server Use ollama serve (background) or ollama run <model> for interactive testing.
Serve a web app Run Ollama as a daemon (systemd service) for always-on inference.
Customization Use OLLAMA_NUM_PARALLEL environment variable to limit concurrent requests (e.g., 1-2 for A1 Flex’s 4 cores).

Build a Python chatbot

Now that we know how to connect to both Ampere optimized Ollama and Docker Model Runner, we can build some interactive functionalities into our previous Python script so we can pick a model to chat with. We will take advantage of the OpenAI API compatibility of both service to streamline and reuse code.

When you run the following script, it will:

query BOTH Docker Model Runner and Ampere optimized Ollama for their available models using OpenAI’s client.models.list(). This basically exposes a GET /v1/models endpoint and returns the list of loaded models in OpenAI‑style JSON.
Display each model ID with a number and its client.
Let you choose one, and use that model for the chat request.
Prompt you for your message, which is inserted into the "role": "user" message.
Send the request and prints the reply.
Pick a different model without restarting.

from openai import OpenAI

# --- Config ---
# Adjust MODEL_RUNNER_BASE depending on where you run it:
# From another container: use `host.docker.internal`
# On host: use `localhost`
MODEL_RUNNER_BASE = "http://localhost:12434/engines/llama.cpp/v1"
OLLAMA_OPENAI_BASE = "http://localhost:11434/v1"   
API_KEY = "not-needed"

# 1. Create OpenAI clients for both backends
mr_client = OpenAI(base_url=MODEL_RUNNER_BASE, api_key=API_KEY)
ollama_client = OpenAI(base_url=OLLAMA_OPENAI_BASE, api_key=API_KEY)

def get_models(client, source_name):
    try:
        resp = client.models.list()
        return [{"id": m.id, "source": source_name, "client": client} for m in resp.data]
    except Exception as e:
        print(f"Error fetching models from {source_name}: {e}")
        return []

# 2. Fetch available models from both sources
models = get_models(mr_client, "model_runner") + get_models(ollama_client, "ollama")

if not models:
    print("No models found. Make sure Model Runner and/or Ollama are running in OpenAI mode.")
    exit(1)

while True:
    # 3. Show combined list and let user choose
    print("\nAvailable models:")
    for idx, m in enumerate(models, start=1):
        print(f"{idx}. {m['id']} ({m['source']})")

    choice = input(f"Select a model [1-{len(models)}] or 'q' to quit: ").strip()
    if choice.lower() == 'q':
        break
    try:
        model_idx = int(choice) - 1
        if model_idx < 0 or model_idx >= len(models):
            raise ValueError
    except ValueError:
        print("Invalid selection.")
        continue

    selected = models[model_idx]
    print(f"Using model: {selected['id']} from {selected['source']}")

    while True:
        # 4. Ask for user prompt
        user_prompt = input("Enter your prompt (or 'back' to choose another model): ").strip()
        if user_prompt.lower() == 'back':
            break
        if not user_prompt:
            continue

        try:
            # 5. Send the request
            resp = selected["client"].chat.completions.create(
                model=selected["id"],
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": user_prompt}
                ]
            )

            # 6. Print the model's reply
            print("\nModel reply:\n")
            print(resp.choices[0].message.content)
        except Exception as e:
            print(f"Error querying {selected['source']}: {e}")

We can run it with python chat.py. Now you can:

Pick model 1, send multiple prompts
Type back to return to model selection
Pick another without restarting the container

My result looks like this

Available models:
1. ai/qwen2.5:3B-Q4_K_M (model_runner)
2. ai/llama3.2:3B-Q4_0 (model_runner)
3. qwen2.5:3b (ollama)
4. hf.co/AmpereComputing/llama-3.2-3b-instruct-gguf:Llama-3.2-3B-Instruct-Q8R16.gguf (ollama)
Select a model [1-4] or 'q' to quit: 1
Using model: ai/qwen2.5:3B-Q4_K_M from model_runner
Enter your prompt (or 'back' to choose another model): write a haiku for local LLM

Model reply:

Local LLM whispers,
Infinite knowledge flows through,
Wordsmith of words.
Enter your prompt (or 'back' to choose another model):

Improve multi-model serving in Model Runner

Even though client.models.list() will show you multiple models, Docker Model Runner doesn’t truly run them all at once. It loads one into memory at a time, and if you call a different one, it needs to perform these tasks sequentially:

Unload the current model from the llama.cpp backend
Load the new one from disk
Initialize it in memory

Depending on model size and hardware, this can take a long time or appear to hang.

From Docker’s own issue tracker, true concurrent multi‑model serving isn’t fully supported yet. Right now, switching models mid‑session can cause long cold‑start delays or even timeouts if the backend doesn’t handle the swap cleanly.

To avoid the delay, we have a few options

Restart Model Runner with the desired model If you only need one model at a time, stop the container and start it again with the new model before running your client.
Run each model in its own Model Runner container (multi‑container mapping) In this approach, we will not swap models inside a single Model Runner container. Instead, we’ll run each model in its own dedicated Model Runner container on a different port, and have the Python client map each model name to its own base_url.

For example, run these in different ports

docker run -d 
  --name llama3.2 
  -p 12434:12434 
  docker/model-runner:latest ai/llama3.2:3B-Q4_0

docker run -d
  --name qwen3b
  -p 12435:12434
  docker/model-runner:latest ai/qwen2.5:3B-Q4_K_M

Now each model is already loaded in its own container. And in the Python client below, we hard code a mapping of model IDs to their container base URLs. When we pick a model, it automatically points to the correct container/port. You can switch instantly between models without restarting anything or the cold‑start delay. That’ll be the smoothest experience until Docker ships true multi‑model support.

# Map model IDs to their dedicated container base URLs
MODEL_ENDPOINTS = {
    "ai/llama3.2:3B-Q4_0": "http://localhost:12434/engines/llama.cpp/v1",
    "ai/qwen2.5:3B-Q4_K_M": "http://localhost:12435/engines/llama.cpp/v1"
}

models = list(MODEL_ENDPOINTS.keys())

while True:
    # Show models and let user choose
    print("\nAvailable models:")
    for idx, model_id in enumerate(models, start=1):
        print(f"{idx}. {model_id}")

    choice = input(f"Select a model [1-{len(models)}] or 'q' to quit: ").strip()
    if choice.lower() == 'q':
        break
    try:
        model_idx = int(choice) - 1
        if model_idx < 0 or model_idx >= len(models):
            raise ValueError
    except ValueError:
        print("Invalid selection.")
        continue

    selected_model = models[model_idx]
    base_url = MODEL_ENDPOINTS[selected_model]
    client = OpenAI(base_url=base_url, api_key=API_KEY)

    print(f"Using model: {selected_model}")

Monitor resource usage

When we run our Python client in one terminal, we can monitor CPU/RAM usage from another terminal using top or htop.

If we are using Docker Model Runner to manage models, or the Ampere optimized Ollama container, we can also monitor their usage using docker stats or filter with the container’s name by docker stats qwen3b in another terminal too.