Tiffena Kou Tiffena Kou
  • About Me
    • Now
    • About
  • Blog
    • Tech
    • Learning
    • Japan
    • Wellness
    • Blogging
  • Resume

Local LLM inference in OCI Ampere A1

Published

November 21, 2025

In this series, I’m going to walk you through running your own large language models locally in CPU-only setup. Specifically, you will utilize Oracle Cloud Infrastructure’s generous free-tier offering by provisioning an Arm-based Ampere A1 instance, and running the Ampere-optimized llama.cpp container as your inference engine. By using this container, you will be able to leverage an optimized quantization scheme to both reduce model size and speed up inferencing.

We’ll also compare different inference engines on the Ampere platform and look at options to optimize our llama.cpp server.

Let’s go!

Benchmarking CPU-only LLM Inference with Optimization: llama-server flags

AI
LLM
evaluation
Nov 22, 2025
10 min

Benchmarking CPU-only LLM Inference with Optimization: Caching and Batching

AI
LLM
evaluation
Nov 21, 2025
21 min

Benchmarking CPU-only LLM Inference: Prompt Variation

AI
LLM
evaluation
Nov 16, 2025
18 min

Benchmark local LLM inference engines in Oracle Ampere

AI
LLM
evaluation
Nov 12, 2025
23 min

Serving Plamo-2-Translate LLM for Japanese-English Translation on Oracle Ampere VM

AI
LLM
Japanese language
NLP
Nov 1, 2025
17 min

Convert and quantize LLM models with Ampere optimized llama.cpp container

AI
LLM
Oracle Cloud
Oct 30, 2025
13 min

How to run llama.cpp on Arm-based Ampere with Oracle Linux

LLM
Oracle Cloud
Sep 21, 2025
21 min

Serve and inference with local LLMs via Ollama & Docker Model Runner in Oracle Ampere

AI
LLM
software development
Oracle Cloud
Sep 14, 2025
17 min

Running LLMs locally on Ampere A1 Linux VM: Comparing options

AI
LLM
evaluation
Oracle Cloud
Sep 13, 2025
37 min
No matching items
Back to top
 

Copyright 2023-2025, Tiffena Kou