Local LLM inference in OCI Ampere A1
In this series, I’m going to walk you through running your own large language models locally in CPU-only setup. Specifically, you will utilize Oracle Cloud Infrastructure’s generous free-tier offering by provisioning an Arm-based Ampere A1 instance, and running the Ampere-optimized llama.cpp container as your inference engine. By using this container, you will be able to leverage an optimized quantization scheme to both reduce model size and speed up inferencing.
We’ll also compare different inference engines on the Ampere platform and look at options to optimize our llama.cpp server.
Let’s go!
Benchmarking CPU-only LLM Inference: Prompt Variation
AI
LLM
evaluation
Benchmark local LLM inference engines in Oracle Ampere
AI
LLM
evaluation
Serving Plamo-2-Translate LLM for Japanese-English Translation on Oracle Ampere VM
AI
LLM
Japanese language
NLP
Convert and quantize LLM models with Ampere optimized llama.cpp container
AI
LLM
Oracle Cloud
How to run llama.cpp on Arm-based Ampere with Oracle Linux
LLM
Oracle Cloud
Serve and inference with local LLMs via Ollama & Docker Model Runner in Oracle Ampere
AI
LLM
software development
Oracle Cloud
Running LLMs locally on Ampere A1 Linux VM: Comparing options
AI
LLM
evaluation
Oracle Cloud
No matching items