My last blog post has examined the bottlenecks in CPU-based inferencing and explored the performance difference between fixed and varied prompts.
In this post, let’s expand our benchmark into quantifying whether server optimizations can lift performance in our CPU-only environment. We will review the different optimization strategies such as caching and batching (sequential and concurrent), with code and result showing their observed effectiveness. We will also compare the new results of each test with our initial baseline (fixed short prompt, sequential runs): 94.52ms TTFT and 10.57 TPS.
What are Inference optimizations?
LLM serving frameworks use caching and concurrency to optimize the TTFT and TPS metrics. Benchmarking these provide a highly realistic look into client inferencing because they measure the actual quality of service (QoS) that a real client will receive under various conditions. It also reveals the server’s behavior under load, as well as how optimization techniques mitigate CPU bottlenecks.
This post will examine inferencing from 2 aspects:
In real-world client scenarios, the “batch size” is simply the number of concurrent users or requests hitting the server at the same moment. Our batching test peeks into these main performance components:
Throughput saturation ceiling
To establish maximum sustainable throughput (TPS), which is the hard limit for our specific CPU and RAM speed, regardless of configuration.
Queuing tax in concurrency
The Concurrent test models a server under high load (e.g., a short flash crowd on a small web service) and reveals the queuing latency and cost of contention on the CPU, by checking if TTFT skyrockets when batch size increases. For example, if eight clients submit requests simultaneously in a real-world application, seven of them might experience a delay of many seconds before generation even begins.
The caching test simulates a client making requests that rely on contextual memory (long system prompts, conversation history, or document retrieval). It models the most common LLM client scenario: *“The user has a long chat history, and the system needs to quickly generate the next token without re-processing the entire 4,000-token context.” Caching test can confirm if there is a meaningful speedup for TTFT by reusing long context.
Server setup
We will start llama-server with these optimized flags:
./llama-server -m /path/to/qwen-2.5-3b-q4_k_m.gguf \
-t 4 \
--threads-batch 4 \
-np 8 \
--host 0.0.0.0 --port 8080Key parameters:
-m: model path-t 4: The number of threads for prompt processing (pre-fill). 4 = our Ampere A1 free-tier compute instance’s core count.--threads-batch 4: The number of threads for token generation (decode).-np 8: Number of context slots for continuous batching. We’ll start high (e.g., 8) to test queuing limit.
KV Caching
In a Transformer model, the Keys and Values for the self-attention layer are computed for every input token. When generating the next output token, the model needs to look at the KV of all previously tokens (both prompt and already generated output).
Most LLM applications are conversational (chatbots) or agentic (system prompts), involving long context reuse.
Hidden system prompt (shared prefix)
Most modern chatbots use a long, hidden system prompt (e.g., “You are an expert financial assistant. Stay polite, brief, and only answer questions about the stock market. Your name is FinBot.”) This prompt is prepended to every single user turn.
- Cache hit: When a user sends a new message (a different prompt), the LLM still has to process the entire shared system prompt first. The KV Cache stores the computation for this long, identical system prompt.
- Latency reduction: By reusing the cached computation for the system prompt, the LLM bypasses the cold run penalty for the shared prefix, and only has to compute the new few-token user prompt.
Conversation history (incremental reuse)
In a continuous chat session, the current prompt is not just the new user message, but the entire transcript of the conversation up to that point.
- Cache hit: When the user sends “What about Linux?” after “Explain how to install Windows,” the prompt prefix “Explain how to install Windows” is the same across the two turns. The KV Cache reuses the computation for the history, reducing TTFT for the new message.
Caching reduces redundant computation during the decoding phase by reusing attention keys/values, directly improving TPS (Tokens Per Second). It transforms a compute-intensive operation into a memory-bound lookup. The size and management of the KV Cache (via techniques like PagedAttention) are the primary drivers of throughput limits. However, it is only beneficial when the time saved from skipping a long pre-fill outweighs the overhead of cache lookup and context switching.
Code
To accurately measure caching benefits, we will:
Perform a “Cold Run”:
Since the KV cache is useful when a large prefix is identical across different requests (e.g., a common system prompt), we will use a long, unique prefix that guarantees a cache miss, forcing a full prefill calculation and measuring the true TTFT baseline.
The actual length of this prefix is highly dependent on the tokenizer, but repeating it 10 times should guarantee a very long context (>1024 tokens, aiming for ~4k or close to the context window limit).
Ensuring a truly cold run is difficult if the server is already “hot” (weights loaded, prior prompts in cache). Using a unique single-token ID as cache-buster (e.g.,
cache_buster = f"[UNIQUE-ID-{random.randint(10000, 99999)}]") for the cold run is crucial. This cache buster is attached to the prefix to force a cache-miss.Perform “Warm Runs”: Subsequent requests that reuse a significant portion of the same prefix as the cold run without the cache buster. TTFT should be much lower as the server only processes the new tokens after the long prefix.
Sequential execution: Run the cold and warm requests sequentially using the same client object to ensure the KV cache is maintained by the server.
Compare TTFT variations.
We will reuse the library imports and helper functions from our single run tests in my previous post.
LONG_CACHE_PREFIX = """
In the domain of high-performance computing, particularly when dealing with large language models on CPU-only hardware, the performance bottleneck inevitably shifts from computational FLOPS to DRAM memory bandwidth. The Qwen 2.5 3B model, while small, is still limited by the speed at which its multi-gigabyte weight file can be streamed from system RAM to the CPU's compute units for every single token generation step. This is the core reason why parallelizing requests via batching rapidly hits a throughput ceiling (flat TPS) and why the Time To First Token (TTFT) skyrockets due to queuing. Caching, in this memory-bound scenario, must provide massive savings to be worthwhile. It only does this when the prefix that is reused is exceptionally long, guaranteeing that the high cost of the initial prompt pre-fill is avoided entirely. If the time saved from skipping the pre-fill is less than the overhead of a context switch and cache look-up, the warm run will appear slower. Therefore, we must use a long text prefix to properly isolate the gain. This text is intentionally repetitive and verbose to ensure a large token count, simulating a long-form document, a full chat history, or a detailed system instruction that is common across multiple user queries.
""" * 10 We then implement our caching benchmark:
def benchmark_caching(client: openai.OpenAI, model_name: str, base_prompt: str, num_warm: int = 3) -> Dict[str, Any]:
"""Measure caching effectiveness using a long common prefix."""
print("\n--- Caching Test (LONG PROMPT) ---")
# Generate a unique, single-token ID to break any pre-existing cache
# This ensures the Cold Run is a true full re-evaluation.
cache_buster = f"[UNIQUE-ID-{random.randint(10000, 99999)}]"
# 1. Cold run: Forces a cache-miss due to the unique starting token.
# The unique starting token prevents llama.cpp from reusing a cache slot that may have been populated by earlier benchmark runs.
cold_prompt = cache_buster + " " + base_prompt + " Now, state the single word conclusion for the cold run:"
# Calculate the prompt tokens *with* the cache-buster for accurate logging
cold_prompt_tokens = count_tokens(cold_prompt)
print(f"1. Cold Run (Prefix Tokens: {cold_prompt_tokens}, unique start token: {cache_buster})...", end="")
cold_result = benchmark_single_request(client, model_name, cold_prompt, max_tokens=1) # Generate only 1 token
# Handle Cold Run Failure
if "error" in cold_result:
print(f" FAILED: {cold_result['error']}")
return {"cold_ttft": 0, "avg_warm_ttft": 0, "speedup": 0, "status": "Cold run failed"}
cold_ttft = cold_result["ttft_ms"]
print(f" TTFT: {cold_ttft:.2f}ms")
# 2. Warm runs: Use the original, common prefix *without* the cache buster.
# The first warm run (i=0) now builds the correct cache for the 'base_prompt'.
# Subsequent runs (i>0) should hit the cache.
print(f"2. Warm Runs (Cache Reuse on common prefix)...")
warm_ttfTs = []
# Run num_warm + 1 requests: i=0 is cache build, i=1..num_warm are measurements.
for i in range(num_warm + 1):
# The prompt for the warm runs is now the same long prefix, only changing the suffix.
varied_prompt = base_prompt + f" Now, state the single word conclusion for iteration {i+1}:"
warm_result = benchmark_single_request(client, model_name, varied_prompt, max_tokens=1) # Generate only 1 token
if "error" in warm_result:
print(f" Warm Run {i+1} FAILED: {warm_result['error']}")
continue
# Only record the results *after* the initial cache build run (i=0)
if i > 0:
warm_ttfTs.append(warm_result["ttft_ms"])
print(f" Warm Run {i} TTFT: {warm_result['ttft_ms']:.2f}ms")
if not warm_ttfTs:
return {"cold_ttft": cold_ttft, "avg_warm_ttft": cold_ttft, "speedup": 0, "status": "Warm runs failed"}
avg_warm_ttft = statistics.mean(warm_ttfTs)
cache_speedup = 1 - (avg_warm_ttft / cold_ttft)
print(f"\n Summary: Cold TTFT: {cold_ttft:.2f}ms | Avg Warm TTFT: {avg_warm_ttft:.2f}ms")
print(f" Cache Speedup: {cache_speedup*100:.1f}%")
return {"cold_ttft": cold_ttft, "avg_warm_ttft": avg_warm_ttft, "speedup": cache_speedup}Finally, we run the benchmark in main:
# --- Caching Benchmark ---
caching_stats = benchmark_caching(client, config["model"], LONG_CACHE_PREFIX)
print("\n" + "="*50)Result
--- Caching Test (LONG PROMPT) ---
1. Cold Run (Prefix Tokens: 2591, unique start token: [UNIQUE-ID-59078])... TTFT: 186.07ms
2. Warm Runs (Cache Reuse)...
Warm Run 1 TTFT: 16.01ms
Warm Run 2 TTFT: 15.11ms
Warm Run 3 TTFT: 15.35ms
Summary: Cold TTFT: 186.07ms | Avg Warm TTFT: 15.49ms
Cache Speedup: 91.7% The caching test reveals a massive performance win for long contexts, which was the goal of the long prompt.
The 91.7% cache speedup is irrefutable evidence that the model is correctly reusing long contexts. It confirms that the high cost of processing a long system prompt ((186ms) cold TTFT) is nearly eliminated ((15.49ms) warm TTFT) on subsequent messages.
This proves that even on a CPU-only host, KV cache reuse is the most important optimization for reducing latency in conversational or instruction-heavy scenarios, improving user experience. In a real-world chatbot where the system prompt or conversation history is long, the client would experience near-instant response times for subsequent messages, entirely bypassing the (1.8)-minute penalty of a cold run.
Concurrency & Batching
Batching and concurrency aims to optimize the overall system throughput (TPS) by processing multiple requests in parallel. However, as concurrency increases, individual requests may experience higher TTFT due to: * Increased queuing latency: Requests spend more time waiting for resources. * Context switching overhead: The GPU’s overhead in switching between multiple active requests.
As we’re running on a CPU-only system (which is even more strained than a GPU setup), we will test how the server handles both sequential and concurrent requests under load.
Sequential vs. Concurrent Batch
We will run two batch tests. The sequential test finds the hardware’s best-case speed; the concurrent test finds the real-world latency and stability under stress. The difference is about execution order and resource contention.
| Feature | Sequential Batch (No Queuing/Contention) | Concurrent Batch (Measures Queuing Tax) |
|---|---|---|
| Request Order | Strictly one after the other. Request (N) must completely finish before Request (N+1) is started. | Simultaneous launch. Requests (1, 2, 3, ) are all started at the same time by the client. |
| Server View | Only sees a single request running at a time. The resource contention is minimal. | Sees all requests arriving at once and must place them in a queue to be processed by the limited (4-core) hardware. |
| Measured Metric | Measures the maximum raw throughput (TPS) of the hardware, unburdened by queue delays. The TTFT is the true, raw processing time. | Measures the system’s overall performance under high load. The TTFT primarily measures the queuing latency (the wait time). |
Code
Our approach for the tests will be as follows
TTFT Isolation: To find the true prompt processing time without queuing, use B=1 (single, non-concurrent request) with a long prompt (e.g., 8192 tokens) and generate only 1 token. This is our baseline TTFT/PP time.
Unique Request IDs: For both sequential and concurrent, we ensure each prompt has a unique identifier (e.g.,
prompt + f" ID: {random.randint(...)}") to prevent the KV cache from serving responses, thus forcing the server to treat each request as a distinct unit for batching. This ensures we’re measuring true batching efficiency, not accidental caching.Fixed workload: Use short, fixed prompts with fixed output lengths to ensure uniform workload.
Vary batch size (1, 2, 4, 8, 12) to find sweet spot. Test up to 3x A1 Flex core count (4) for stress. Measures the aggregate TPS (total tokens / total time) as Maximum Sustained TPS.
Isolate server throughput (TPS) from client-side queuing: Excludes queuing time on the server by only measuring the time from the start of the first request to the end of the last request. The combined total time will give the aggregate throughput (Batch TPS).
Use the async library: Define a synchronous benchmark
threaded_benchmark, then useasyncio.gatherto call it N parallel times. This threads the entire sync operation per request, and ensures concurrency without messing up the stream iteration.
Again, we will reuse the library imports and helper functions from our single run tests in my previous post, and define batching constants.
SHORT_PROMPT_FOR_BATCHING = "The quick brown fox jumps over the lazy dog." # Used for batching testsWe then define the benchmark code for both a sequential batch and a concurrent test.
# --- BATCHING BENCHMARK ---
def run_sequential_batch(client: openai.OpenAI, model_name: str, prompt: str, batch_size: int) -> Dict[str, Any]:
"""Runs sequential requests to establish a clean baseline (no queuing)."""
results = []
print(f" Running Sequential Batch (BS={batch_size})...", end="")
for i in range(batch_size):
# Use a unique suffix to ensure NO CACHING between sequential runs
unique_prompt = prompt + f" ID: {random.randint(10000, 99999)}"
result = benchmark_single_request(client, model_name, unique_prompt)
if "error" in result:
print(f"FAILED (Req {i+1}): {result['error']}")
continue
results.append(result)
if not results:
raise Exception("All sequential requests failed.")
total_latency_s = sum(r["total_time_s"] for r in results)
total_tokens = sum(r["generated_tokens"] for r in results)
avg_tps = total_tokens / total_latency_s
avg_ttft = statistics.mean([r["ttft_ms"] for r in results])
print(f" Done.")
return {
"type": "Sequential",
"batch_size": batch_size,
"avg_ttft_ms": avg_ttft,
"avg_tps": avg_tps,
}
async def run_concurrent_batch(client: openai.OpenAI, model_name: str, prompt: str, batch_size: int) -> Dict[str, Any]:
"""Runs concurrent requests to measure queuing tax and aggregate TPS."""
# Nested function to handle unique prompts and wrap the synchronous call
def threaded_benchmark():
# Use a unique suffix to ensure NO CACHING between concurrent runs
unique_prompt = prompt + f" ID: {random.randint(10000, 99999)}"
return benchmark_single_request(client, model_name, unique_prompt)
tasks = [asyncio.to_thread(threaded_benchmark) for _ in range(batch_size)]
start_time = time.time()
batch_results = await asyncio.gather(*tasks, return_exceptions=True)
end_time = time.time()
valid_results = [r for r in batch_results if not isinstance(r, Exception) and "error" not in r]
if not valid_results:
raise Exception("All concurrent batch requests failed.")
total_time = end_time - start_time
total_tokens = sum(r["generated_tokens"] for r in valid_results)
# Aggregate throughput (Wall Clock)
batch_tps = total_tokens / total_time
avg_ttft = statistics.mean([r["ttft_ms"] for r in valid_results])
return {
"type": "Concurrent",
"batch_size": batch_size,
"avg_ttft_ms": avg_ttft,
"batch_tps": batch_tps,
"success_rate": len(valid_results) / batch_size * 100
}Finally, we explicitly run the Sequential Batch first to get a clean baseline without queuing, then the Concurrent Batch to measure queuing tax.
# --- Batching Benchmarks ---
print("\n--- Batching Benchmarks (Short Prompt, Max Tokens=50) ---")
batch_sizes = [1, 2, 4, 8, 12] # Test up to 3x A1 Flex core count for stress
# 1. Sequential Test (Baseline TTFT and TPS)
print("\n**1. Sequential Baseline (No Queuing/Contention):**")
for bs in batch_sizes:
if bs > 4: # Stop sequential test after core count for efficiency
continue
try:
stats = run_sequential_batch(client, config["model"], SHORT_PROMPT_FOR_BATCHING, bs)
print(f" Seq Batch {bs}: Avg TTFT {stats['avg_ttft_ms']:.2f}ms | Avg Gen+Prefill TPS {stats['avg_tps']:.2f}")
except Exception as e:
print(f" Seq Batch {bs} FAILED: {e}")
print("\n**2. Concurrent Test (Measures Queuing Tax and Aggregate TPS):**")
# 2. Concurrent Test (Measures Queuing Tax on TTFT)
for bs in batch_sizes:
try:
stats = asyncio.run(run_concurrent_batch(client, config["model"], SHORT_PROMPT_FOR_BATCHING, bs))
print(f" Conc Batch {bs}: Avg TTFT {stats['avg_ttft_ms']:.2f}ms | **Aggregate TPS {stats['batch_tps']:.2f}** | Success {stats['success_rate']:.1f}%")
if bs > 4 and stats['avg_ttft_ms'] > 1000:
print(f" -> Interpretation: High TTFT confirms **queuing latency** on the CPU is a major factor at this concurrency level.")
except Exception as e:
print(f" Conc Batch {bs} FAILED: {e}")Result
--- Batching Benchmarks (Short Prompt, Max Tokens=50) ---
**1. Sequential Baseline (No Queuing/Contention):**
Running Sequential Batch (BS=1)... Done.
Seq Batch 1: Avg TTFT 1606.63ms | Avg Gen+Prefill TPS 9.12
Running Sequential Batch (BS=2)... Done.
Seq Batch 2: Avg TTFT 649.59ms | Avg Gen+Prefill TPS 9.24
Running Sequential Batch (BS=4)... Done.
Seq Batch 4: Avg TTFT 715.45ms | Avg Gen+Prefill TPS 9.97
**2. Concurrent Test (Measures Queuing Tax and Aggregate TPS):**
Conc Batch 1: Avg TTFT 698.89ms | **Aggregate TPS 10.97** | Success 100.0%
Conc Batch 2: Avg TTFT 1408.57ms | **Aggregate TPS 10.17** | Success 100.0%
Conc Batch 4: Avg TTFT 2707.35ms | **Aggregate TPS 10.12** | Success 100.0%
Conc Batch 8: Avg TTFT 6594.63ms | **Aggregate TPS 10.91** | Success 100.0%
-> Interpretation: High TTFT confirms **queuing latency** on the CPU is a major factor at this concurrency level.
Conc Batch 12: Avg TTFT 9430.09ms | **Aggregate TPS 11.06** | Success 100.0%
-> Interpretation: High TTFT confirms **queuing latency** on the CPU is a major factor at this concurrency level.The batching results clearly illustrate the classic CPU-bound, queuing-based performance curve typical of llama.cpp on non-GPU hardware. In our setup, concurrency hasn’t improved throughput while overloading the CPU’s ability to handle multiple pre-fill tasks.
The concurrent test shows a significant increase in average TTFT from 698.8ms (Batch 1) to 9430.09ms (Batch 12), while the TPS remains effectively flat (~10.12 - 11.06)
TTFT skyrockets
TTFT includes queueing time and prompt pre-fill time (processing the entire input prompt to build the Key-Value (KV) cache). Real-world usage is defined by burstiness, when multiple users will inevitably hit the server simultaneously. In our concurrent batching, when llama-server receives the requests simultaneously, instead of processing them instantly, the server places them in a queue or assigns them to shared processing slots. As the concurrent batch size increases: * BS=1 to BS=4: TTFT jumps from 698.89 ms to 2707.35 ms (a 2.87x increase). * BS=1 to BS=8: TTFT hits 6.6 seconds.
This is because requests arrive faster than the CPU can process them, waiting in the server’s queue for its turn to access the fully saturated CPU cores. This queuing/scheduling delay is added directly to the TTFT, causing it to skyrocket.
TPS plateaus
The TPS flatlining indicates resource saturation. TPS in LLM inference, especially during the steady-state generation phase (after the first token), is typically memory-bound (bottlenecked by how fast the GPU/CPU can read model weights and the KV cache from memory). Our system is already running at its maximum sustainable throughput (system RAM bandwidth, or CPU core count for the threads) even at a small batch size. Once a saturation point is reached, adding more concurrent requests doesn’t increase total throughput but just forces each request to wait longer and share resources, maintaining the total TPS while drastically increasing individual latency (TTFT and overall latency).
Conclusion
In conclusion, the benchmark tests reveal how a small optimization such as KV cache reuse aids in UX improvement. It also shows how serialization in single-batch servers like llama.cpp process requests one-by-one (no true parallelism), so larger concurrent batches create a queue. The first request gets normal TTFT (~698ms), but subsequent ones wait for it to finish, inflating the average TTFT (e.g., batch 12 = ~94s total wait time / 12 = 7.83s avg). We do not extract any material gain from batching in a highly constrainted 4-core CPU-only environment.