Local LLM inference in OCI Ampere A1

Published

November 21, 2025

In this series, I’m going to walk you through running your own large language models locally in CPU-only setup. Specifically, you will utilize Oracle Cloud Infrastructure’s generous free-tier offering by provisioning an Arm-based Ampere A1 instance, and running the Ampere-optimized llama.cpp container as your inference engine. By using this container, you will be able to leverage an optimized quantization scheme to both reduce model size and speed up inferencing.

We’ll also compare different inference engines on the Ampere platform and look at options to optimize our llama.cpp server.

Let’s go!

Benchmarking CPU-only LLM Inference with Optimization: `llama-server` flags

AI

LLM

evaluation

Benchmarking CPU-only LLM Inference with Optimization: Caching and Batching

AI

LLM

evaluation

Benchmarking CPU-only LLM Inference: Prompt Variation

AI

LLM

evaluation

Benchmark local LLM inference engines in Oracle Ampere

AI

LLM

evaluation

Serving Plamo-2-Translate LLM for Japanese-English Translation on Oracle Ampere VM

AI

LLM

Japanese language

NLP

Convert and quantize LLM models with Ampere optimized llama.cpp container

AI

LLM

Oracle Cloud

How to run llama.cpp on Arm-based Ampere with Oracle Linux

LLM

Oracle Cloud

Serve and inference with local LLMs via Ollama & Docker Model Runner in Oracle Ampere

AI

LLM

software development

Oracle Cloud

Running LLMs locally on Ampere A1 Linux VM: Comparing options

AI

LLM

evaluation

Oracle Cloud