Serving Plamo-2-Translate LLM for Japanese-English Translation on Oracle Ampere VM

In this post, we’ll explore Plamo-2-translate, an 8B-parameter LLM fine-tuned specifically for Japanese-English translation. Despite its modest size, it delivers quality on par with much larger 200B models. This makes it ideal for local deployment, preserving privacy and slashing costs.

We’ll start by reviewing why traditional machine translation (MT) models often underperform compared to general-purpose LLMs. Then, we’ll dive into Plamo-2-translate itself, including a hands-on setup: serving it on an Oracle Cloud Infrastructure (OCI) Ampere VM using optimized quantization for efficient inference. Finally, we’ll review Plamo’s unique architecture and training innovations.

Traditional MT Models vs LLMs

The shift from specialized MT systems to versatile LLMs marks a pivotal evolution in AI. Models like opus-mt-ja-en and m2m100_418M, while efficient, frequently trail behind heavyweights like GPT, Gemini, and Grok. Let’s break down the key reasons.

Specialization vs. Generality

Traditional MT models like opus-mt-ja-en and m2m100_418M are encoder-decoder architectures laser-focused on translation. They ingest a source sentence and output a target one that’s grammatically sound and semantically close.

Training data: Built on massive parallel corpora (paired sentences across languages), honing efficiency for translation but restricting scope.
Knowledge scope: Limited to linguistic patterns; no grasp of broader domains like law, medicine, or casual speech.

By contrast, decoder-only LLMs like GPT-5, Gemini, Grok, and DeepSeek are generalists:

Training data: Trained from petabytes of internet-scale text—books, articles, code, chats, building up broad “world knowledge.”
Core mechanism: Focus on next-token prediction, enabling creative, context-rich generation across languages.

For translation, these LLMs don’t lookup mappings; they generate fluent sentences probabilistically, drawing on holistic understanding.

World knowledge and context

A stark knowledge gulf explains the big performance differences.

Contextual depth: MT models process sentence-by-sentence, stumbling on Japanese’s context-heavy nuances (e.g., implied meanings). LLMs leverage vast context windows to track terminology, idioms, and discourse across documents.
Beyond literalism: Traditional models churn out word-for-word results that sound off culturally or stylistically. LLMs, steeped in diverse text, prioritize meaning over mechanics.
Natural fluency: Exposure to human writing equips LLMs to handle slang, idioms, and tone, areas where MT falters into awkwardness.

Decoding MT failures

You’ve likely seen these quirks from traditional MT models:

Literal nonsense or random words: Out-of-domain inputs trigger fallback to rote patterns, yielding gibberish or hallucinations from sparse training data.
Echoing source Text: High uncertainty prompts a “safe” copy-paste, giving up when probabilities don’t align.
Ambiguity traps: Context-dependent words (e.g., technical vs. casual) baffle MT without external cues, while LLMs can infer from surrounding text.
Repetition: Autoregressive generation builds text token-by-token (words or subwords like “ing”). Probabilities guide each step, but greedy decoding, which always picking the top token can loop endlessly. For example, the MT model scores vocabulary (e.g., “the,” “oh,” “!”) based on prior context, selecting the max-probability token to be the next. A common token (e.g., “oh”) snowballs in this context. With an input おい、てめぇなにやってんだ！コラァ！ちっ、オラぁ一番強ぇんだぜコラァ！, the interjection コラァ！ maps to oh!, then repetition reinforces itself, yielding an putput: “Oh, what do you do! oh, oh, oh, oh…”

These are all symptoms of narrow training and naive decoding. LLMs break free with richer context and smarter strategies. In essence, while MT shines in silos, LLMs excel as multipurpose processors, yielding translations that feel human: accurate, contextual, and idiomatic.

Introducing Plamo-2-Translate

Plamo-2-translate builds on the PLaMo 2.1 8B base, fine-tuned for translation. The base PLaMo 2’s advances stem from better architecture (hybrid Mamba SSM + expanded attention), loss diagnostics, data curation (inspired by FineWeb/DataComp-LM), and self-tuning. These tackle Japanese-specific hurdles like long-context recall and bilingual fidelity, outperforming priors on benchmarks while scaling down size.

It handles diverse domains (e.g., travel brochures, lyrics, tech docs, PR) with contextual finesse, thanks to high English/Japanese ratios in training.

Key benefits: - Compact power: Matches 200B-model quality in an 8B footprint, runnable on consumer hardware. - Specialized: Built specifically for Japanese-English translation, not as a general-purpose chatbot. - Privacy-first: Local runs avoid cloud risks for sensitive docs. - Efficiency: Low resource draw for on-premise setups.

You can try the online demo to see for yourself, and the model is available on Hugging Face.

Goal: Self-host on OCI Ampere VM

After being impressed by the demo’s ability to handle everything from travel brochures to technical docs, I wanted to host it myself. Since there’s no major inference provider serving it yet, my trusty Oracle Cloud Infrastructure (OCI) Ampere VM was the perfect candidate.

Alternative: On Apple Silicon Macs, try the Plamo CLI via MLX for quick local runs.

I skipped pre-quantized GGUF files from mmnga/plamo-2-translate-gguf to leverage Ampere’s llama.cpp optimizations. This meant I’d be converting and quantizing the model myself.

Hands-On: Deploying and Using Plamo-2-Translate

Here’s the practical guide to getting PLaMo-2-Translate working. The most important part to understand is how to ask the model for a translation. It’s not instruction-tuned—raw for pattern completion, excelling at structured translation prompts but flopping on chatty inputs.

Key challenge: not a chatbot!

First, you have to know what kind of model this is. The model card clearly states it has NOT been instruction-tuned.

This is a critical distinction:

An instruction-tuned model (like ChatGPT) is trained to understand commands like “Translate this for me:” and hold a conversation.
A base or task-fine-tuned model (like this one) is a raw “next-word predictor.” It’s trained to complete text based on patterns it has learned.

This model wasn’t trained on chats; it was fine-tuned on specific, structured text pairs. To get a translation, we must perfectly mimic that structure. If we don’t, the model will just “complete” our prompt in the wrong way (e.g., by writing more Japanese text).

Step 1: Crafting prompts

You can peek at the model’s metadata to find its preferred template. For PLaMo-2-Translate, the prompt needs to be manually constructed with special tokens and input/output blocks.

Let’s test this with llama-cli using a prequantized Q4_K_M model from first.

Example 1: The WRONG Way

If we just pass the text, the model gets confused.

# We use -no-cnv to disable chat mode and send the raw prompt
./build/bin/llama-cli \
    -m ./models/plamo-2-translate-Q4_K_M.gguf \
    --prompt "input lang=Japanese|English\n昨年勝利した石破茂首相の支持者の動向も鍵を握る。" \
    -no-cnv

Output (Incorrect):

input lang=Japanese|English
昨年勝利した石破茂首相の支持者の動向も鍵を握る。安倍晋三元首相の支持者の一部は、石破氏が首相に就任した後、岸田氏に投票する意向を示している。
 [end of text]

See? It just continued the prompt with more Japanese text. It’s doing “completion,” not “translation.”

Example 2: The RIGHT Way

Now, let’s use the full template. Notice the special <|plamo:op|> tokens and the clear output tag at the end, which acts as the trigger.

./build/bin/llama-cli \
    -m ./models/plamo-2-translate-Q6_K.gguf \
    -p "<|plamo:op|>dataset\ntranslation\n<|plamo:op|>input lang=Japanese|English\n昨年勝利した石破茂首相の支持者の動向も鍵を握る。\n<|plamo:op|>output\n" \
    -no-cnv

Output (Correct!):

dataset
translation
input lang=Japanese|English
昨年勝利した石破茂首相の支持者の動向も鍵を握る。
output
The stance of supporters of Prime Minister Shigeru Ishiba, who won last year’s election, will also be a crucial factor.
 [end of text]

Success! That’s a high-quality translation. This strict reliance on templates is the key to using task-specific models.

Besides |plamo:op|>dataset\ntranslation\n<|plamo:op|>input lang=Japanese|English\nYour Japanese text here.\n<|plamo:op|>output\n, we can also infer from vLLM card this prompt template: <|plamo:op|>dataset\ntranslation\n<|plamo:op|>input lang=Japanese\nYour Japanese text here.\n<|plamo:op|>output lang=English\n

Step 2: Serving with llama-server

Now that we know the prompt, let’s serve it as an API endpoint. We’ll start the server:

./build/bin/llama-server -m ./models/plamo-2-translate-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8081

Since this isn’t a chat model, we’ll use the /completion endpoint, not /chat/completion.

Let’s issue a curl command to send a translation request for a travel brochure snippet:

curl http://localhost:8081/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "<|plamo:op|>dataset\ntranslation\n<|plamo:op|>input lang=Japanese|English\nあっという間に過ぎ去る岩手の秋 ただ、その一瞬の季節にこそ 岩手の魅力が詰まってます。 この季節にしか味わえない岩手を あるいてみませんか? 秋は短し旅せよ岩手いわて 秋季観光キャンペーンが始まります。\n<|plamo:op|>output\n",
    "n_predict": -1,
    "temperature": 0.8,
    "top_p": 0.95
  }'

Performance on Ampere

We’ll look at results from a regular llama-server with Q4_K_M, then an Ampere container running both regular Q4_K_M quantized model file, and self-built, Ampere-optimized quants.

llama-server with Q4_K_M:

Translation: “Iwate’s autumn passes by in the blink of an eye—yet precisely in this fleeting season lies Iwate’s most captivating charm. Why not explore Iwate’s unique autumn offerings? Autumn may be brief, but it’s time to journey through Iwate! Autumn Tourism Campaign now underway.”
Speed: 4.32 tokens/second (eval)

Ampere Container (Q4_K_M):

Translation: “Iwate’s autumn slips by in an instant - but precisely in those fleeting moments lies all of Iwate’s appeal. Why not explore what makes Iwate special during this unique season? Autumn may be brief, but if you hurry, Iwate awaits. - the”Short but Perfect Autumn: Travel Iwate” autumn tourism campaign is now underway.”
Speed: 4.93 tokens/second (eval)

Self-Built Ampere-Optimized (Q4_K_4):

Translation: “Iwate’s autumn slips away so quickly - but in those fleeting moments, it holds all of Iwate’s charm. Why not explore what makes Iwate special only during this brief season? Autumn is short - so let’s travel and experience Iwate now! Iwate Autumn Tourism Campaign begins.”
Speed: 6.20 tokens/second (eval)

Self-Built Ampere-Optimized (Q8R16):

Translation: “The fleeting autumn in Iwate passes in an instant, yet all of Iwate’s charm is condensed into this short season. Why don’t you visit and experience Iwate during this unique season? "Autumn is short, travel Iwate" tourism campaign begins.”
Speed: 4.84 tokens/second (eval)

Comparison

llama-server Q4_K_M: “Iwate’s autumn passes by in the blink of an eye—yet precisely in this fleeting season lies Iwate’s most captivating charm. …” (Total: 24s / 120 tokens)
Ampere Container Q4_K_M: Similar output from the same model, faster prompt eval (9s / 62 tokens).
Ampere-Optimized Quant: Q4_K_4: Crisp output; blazing prompt (3s / 62 tokens), total 13s / 124 tokens.
Ampere-Optimized Quant: Q8R16: More lyrical: “The fleeting autumn in Iwate passes in an instant…”; prompt 2.3s / 62 tokens, total 15s / 116 tokens.

Interactive mode (chat-like)

For multi-turn:

./build/bin/llama-cli -m ./models/plamo-2-translate-Q6_K.gguf --jinja --chat-template-file plamo-chat-template.jinja2 --interactive

Auto-wraps inputs with tags (e.g., <|plamo:op|>input lang=Japanese|English\n + text + <|plamo:op|>output\n).
Add --system-prompt "You are a translator." for context.
Great for iterative tweaks as it preserves history.

Tip: Skip Jinja for one-off jobs, as manual prompts ensure precision. Templates shine in read–eval–print loop but can cause problems on curls.

Deep dive: PLaMo 2’s innovations

For those who want to know why PLaMo 2 (the base model) performs so well, its architecture is a fascinating case study in solving modern LLM challenges. PLaMo 2 focused on five key areas:

New architecture: Moving beyond pure Transformers with a hybrid Mamba + Attention model.
Training stability: Diagnosing and managing loss spikes.
Data curation: A sophisticated data prep pipeline.
Optimization: Using Self-Tuning Networks (STNs).

1. Architectural shifts

Transformer limits

As LLM use cases began requiring tens of thousands of tokens, the original PLaMo-100B using a standard Transformer began to suffer from its core limitations:

Quadratic complexity (O(n^2)): As the context window (sequence length) grows, memory and computation required by the self-attention mechanism explode. This makes handling 32k+ token contexts incredibly expensive and slow.
Inference overhead: The need to cache previous key-value states leads to vastly increased memory consumption, severely limiting throughput and practical deployment.
Extrapolation troubles: Poor generalization for sequence lengths far greater than those seen during training.

The Fix: A Hybrid Mamba + Expanded Attention

Per Plamo team’s technical blog, they first tried a hybrid architecture using Samba, which combines:

Mamba (SSM): A Selective State Space Model that has linear complexity (O(n)) , great for long sequences.
Sliding Window Attention (SWA): A standard Transformer attention, but only looking at a small, local window (e.g., 2,048 tokens) for long-term efficiency and short-range precision.

However, this hybrid failed “needle-in-a-haystack” tests. It couldn’t find distant info (the “needle”) if it was outside its SWA window. The SSM’s compressed state wasn’t high-fidelity enough for perfect recall.

By expanding the SWA window to the full 32k context before continual pre-training (CPT), this regains the global context of a Transformer, solving the retrieval problem. This was a deliberate tradeoff: sacrifice some efficiency during CPT to gain massive long-context performance in the final model. The model is shown to excels at long-retrieval (e.g., Passkey tasks) and boosts LongBench/pfgen-bench scores.

To make this work, they also adjusted the Rotary Position Embeddings (RoPE) by setting rope_theta to a very high value (1,000,000), a technique also used in models like Gemma-3 to handle long-context positional data.

Architecture summary

Feature	PLaMo 2 Approach	Standard Transformer
Core Module	Hybrid: Mamba SSM + Sliding-Window Attention	Full Self-Attention
Attention Window (pre-CPT)	Localized (2K)	Global
Attention Window (CPT+)	Full (32K)	Global (memory-capped)
Positional Encoding	Adjustable Base Frequency RoPE (High `theta`)	Standard RoPE/ALiBi
Memory (pre-CPT)	Linear (SSM + windowed)	(O(n^2))
Memory (CPT+)	(O(n^2)) in full window	(O(n^2))
Long Context Recall	High (after CPT)	Limited (by memory/cost)
Language Support	Bilingual (JP/EN), Code, Math	Often EN-centric

2. Taming loss spikes

During training, the team saw “spikes” in the loss curve, especially when weight reusing (initializing a new model from a smaller, already-trained one). This is a known phenomenon, sometimes related to the “Lower-Loss-as-Sharper” (LLAS) structure.

Cause of LLAS

As the model gets better (lower loss), its “loss landscape” can become a “sharp, narrow valley.”

In this sharp valley, the model is unstable. A small data or gradient change can “bounce” it out, causing the loss to spike.
After the spike, the optimizer usually finds a “flatter, wider valley” (a more stable, generalizable solution) and the loss goes down again.

This often happens because of:

Reuse mismatch: Layer misalignments demand adjustment.
Learning rate or data shifts: Abrupt schedules or outlier batches.
Numerics: FP16/BF16 overflows.

Mitigations

Gradient clipping.
Monitoring/early-stop.
Phased LR ramps.
Clean batches.

In PLaMo 2, spikes result in faster convergence, proving that controlled spikes can be part of a healthy training process.

3. Curated data pipeline

PLaMo 2’s “deep curation” mixes volume with quality, prioritizing JP/EN balance (high-ratio datasets) for translation fidelity.

Composition & Filtering

Sources: Filtered CommonCrawl (deduplication via MinHash/LSH); Japanese data from PLaMo-100B project; EN opens.
Filtering: Used “educational value” filters, inspired by best practices from FineWeb and DataComp-LM.
Synthetic Data: Used LLMs to generate new data via paraphrasing and translation, similar to the Magicoder approach for code.
Sampling: Category-based downsampling to remove biases.
Quality Control: Used an “LLM-as-a-Judge” to score and filter the quality of the synthetic data.
Code/Math: Annotated expansions from GSM8k/Lila seeds.

Data splits

The 1B model was trained on 4 trillion tokens with this approximate mix:

Dataset Type	Phase 1 (%)	Phase 2 (%)	Total Tokens
English	45	35	1.75T
Japanese	30	40	1.25T
Coding	15	15	0.6T
Other	10	10	0.4T

This mix outshines C4/FineWeb/RedPajama with bilingual/synthetic integration.

4. Self-Tuning: adaptive optimization

PLaMo 2 also leverages Self-Tuning Networks (STNs). Instead of setting hyperparameters (like dropout rate) once and leaving them, STNs adjust these parameters during the training process.

This is applied during Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). For example, during DPO (the process of teaching the model to prefer “good” answers over “bad” ones), a self-tuning approach can dynamically adjust the regularization or even the SFT loss based on metrics, leading to a more stable and effective alignment.

This technique helps the model converge faster and to a more robust state, leading to better performance in instruction and Japanese tasks. The same technique is also validated in other modern LLMs like Llama 3 and Gemini.

Comparison to Other Modern Training Methods

Technique	Description	Self-Tuning in PLaMo 2	Notes
SFT (Supervised Fine-Tuning)	Mimic annotated outputs	Yes; combined with self-tuning rules	Industry norm; more effective when combined with self-tuning of augmentation or filtering
DPO (Direct Preference Opt.)	Optimize based on pairwise preferences for responses opts	Yes; tuning loss and regularization	State-of-art for alignment beyond RLHF; self-tuning loss key for peak performance
RLHF (Reinforce. Learn w/ HF)	Human Feedback-driven RL	Not primary in PLaMo 2 pipeline	Unstable; requires sophisticated reward modeling; DPO preferred
PEFT/LoRA/Adapters	Parameter-efficient downstream tuning	Not directly, but similar spirit	Efficient, but impact limited compared to full hyperparam tuning