In previous posts, I’ve demonstrated multimodal search against the Unsplash Lite dataset using both Chroma, pgvector and the CLIP model. While the local search performs great with minimal latency (~1.6ms average), the embedding phase is very slow (~130ms for a simple text string).
In this post, I will look at optimizing this latency, specifically in improving the query embedding phase with different methods. One method I will survey deeper is moving to Oracle Autonomous Database 26ai, which offers a unique performance advantage by running the embedding model directly inside the database kernel.
Why optimize?
For a dataset of 25,000 images (with 512-dimensional embeddings for ViT-B/32), our primary bottleneck is the query embedding phase, as even brute-force similarity search takes less than 5ms for that volume. Optimizing CLIP inference for query embedding involves shifting from general-purpose deep learning libraries like PyTorch/SentenceTransformers to specialized inference runtimes and managed services. The followings are standard optimization practices:
- Switch to ONNX Runtime (CPU/GPU):
SentenceTransformeruses PyTorch by default, which carries significant overhead. Converting the model to ONNX and using ONNX Runtime (withOpenVINOfor Intel CPUs orTensorRTfor NVIDIA GPUs) can reduce latency by 3x to 5x. > This is the approach I will take in this post. - Quantization (FP16 or INT8): Converting the weights from FP32 to FP16 (minimal accuracy loss) or INT8 (requires calibration) significantly reduces the computational load and memory bandwidth requirements.
- Model Pruning & Distillation: Use smaller “distilled” versions like MobileCLIP or FastCLIP if the slightly lower accuracy is acceptable for your use case.
- Preprocessing Optimization: For image queries, ensure our image resizing and normalization (typically using
PilloworOpenCV) are vectorized and pre-allocated to avoid Python’s Global Interpreter Lock (GIL) bottlenecks.
Why Oracle Autonomous AI Database (OADB)
I’ve been a big fan of Oracle Cloud’s free-tier offerings as I found them to be generous and robust enough for independent developers. In the database realm, each Always-Free tier account can provision two Autonomous Database instances with 1 OCPU and 20 GB of storage each.
In my user scenario, I need a database that is similar to pgvector, which provides a first-class VECTOR type, distance functions, and approximate/exact similarity search. Autonomous Database nails this. What caught my eyes in addition to the built-in vector store is its In-Database ONNX Runtime, which eliminates the need to provision, configure, and maintain a separate environment to generate embeddings. It also removes the need to move data from the database to a separate, improving both latency and security. By moving my CLIP model inside the database, I should be able to eliminate the overhead of Python-to-DB data transfers, and leverage Oracle’s native C-based ONNX runtime for on-the-fly embedding of queries directly in SQL via the VECTOR_EMBEDDING function.
We are limited to models that can be converted to ONNX. While CLIP is supported, extremely large or custom architectures may require more effort to import than simply calling an API.
Here are the advantages of OADB:
Latency: Because the CLIP model runs inside the DB kernel via an integrated ONNX runtime, there is no network round-trip. End-to-end SQL execution for an embedding typically takes < 10ms.
- Elimination of Inter-Process Communication (IPC): In our current setup, the query embedding (created in Python, whether via ONNX or SentenceTransformer) must be generated in our app’s memory space, serialized and sent over a network/socket to Chroma or pgvector, and then deserialized. In OADB, the embedding is generated and passed to the search engine in-memory.
- Hardware-Level Optimizations: Oracle’s in-database ONNX runtime is highly optimized for SIMD (AVX-512) instructions on Intel/AMD CPUs. It bypasses the Python interpreter’s overhead entirely.
- Unified Execution Plan: We can run a single SQL query like:
`SELECT * FROM images ORDER BY VECTOR_DISTANCE(image_vector, VECTOR_EMBEDDING(clip_model USING :query_text)) FETCH FIRST 5 ROWS ONLY;` The database optimizer treats the embedding and the search as a single operation, maximizing CPU cache locality.Throughput: Throughput is bound by the OCPUs or GPUs allocated to an Autonomous instance. It is highly efficient for bulk vectorization because it avoids the overhead of thousands of HTTP/JSON calls.
Converged Database: The Transaction Processing workload has multi-model support: it handles JSON, graph, text, spatial, vector, and relational data within a single platform. It includes a native JSON data type (OSON) that is faster than standard text-based JSON, allowing us to index and search documents with SQL or simple APIs. In our case, we can store image embeddings (vectors), documents (JSON/Text), and relational data (photo metadata) all in one place, as the backend for a search application for fast, low-latency responses with a user-facing frontend.
For example, we can create a single table that looks like this to handle our entire requirement:
CREATE TABLE image_library ( id NUMBER PRIMARY KEY, image_name VARCHAR2(255), -- The Vector: embedding VECTOR(512, FLOAT32), -- The Document: metadata JSON, -- Other Data: upload_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP, tags VARCHAR2(4000) );This enables us to run a hybrid search in one query: “Find images similar to this text AND where the document metadata says the category is ‘Nature’.” In this way, a single query can retrieve documents, images, audio, video, and table rows together without ETL hops.
Choosing from different OADB workloads
For a vector search use case involving 25,000 images and real-time queries, here is the breakdown:
| Category | Best For | Technical Characteristics | Why Not Ideal for our Vector Search App? | Recommendation |
|---|---|---|---|---|
| Autonomous Transaction Processing (ATP) | Real‑time applications, high concurrency, frequent small updates, OLTP workloads. | Row‑oriented storage, low‑latency index lookups, optimized for single-query response such as point queries, low‑latency vector search and transactional workloads. | None — this is the correct engine for a “search bar” experience where each query must return in tens of milliseconds. | Use ATP. Best match for low‑latency vector search + metadata queries. ATP is faster for single‑query, low‑latency workloads because row‑store avoids columnar decompression and is optimized for OLTP access paths. |
| Autonomous Data Warehouse (ADW) | Analytics, BI dashboards, massive scans, aggregations, complex reports. | Columnar storage, vectorized execution, optimized for high-throughput batch and analytic scans, not for interactive search. | Columnar layout adds overhead for single‑row access patterns; latency is higher for single‑query vector search. | Use only if the workload is analytics‑heavy (millions of rows scanned per query). |
| Lakehouse | Big data analytics, ETL, large‑scale warehousing. | Designed for high‑throughput ingestion and large analytical scans; not optimized for millisecond‑level point queries. | Latency too high for interactive search; optimized for batch analytics, not real‑time vector retrieval. | Not recommended for my use case. |
| JSON Database | Pure document‑store apps with JSON‑only data models. | Essentially ATP with a restricted feature set; optimized for JSON operations. | I have relational + vector + metadata needs. JSON DB removes features we need (indexes, SQL flexibility). | Stick with full ATP for maximum flexibility. |
| APEX Service | Low‑code developers building UI‑driven apps quickly. | ATP instance with a UI layer; same underlying engine. | not useful if we’re building our own frontend. | Use standard ATP instead of APEX. |
Perform embedding search in Oracle Database 26ai
Let’s setup our environment! We will first provision a free-tier Autonomous Database instance, create a working connection to it, then upload our existing Unsplash image embeddings to a table with native vector support. We will then convert and upload the CLIP model in ONNX format to the database. Finally, we will build a search index for better perform, and query the images with natural language.
Phase 1: Setup Oracle 26ai Autonomous Database (Free Tier)
Sign Up/Login: Go to the Oracle Cloud Console.
Create Database: Navigate to Oracle Database > Autonomous Database.
Configuration:
- Display Name:
VectorDB - Database Version: Ensure you select 26ai.
- Workload Type: Transaction Processing
- Deployment Type: Shared Infrastructure.
- Always Free: Toggle the Always Free switch to ON.
- Credentials: Set a strong ADMIN password.
- Display Name:
Phase 2: Setup database connection
Oracle Autonomous Database supports both mTLS and TLS connections. Here’s a breakdown of the two authentication modes:
| Feature | mTLS (Mutual TLS) | TLS (One-way TLS) |
|---|---|---|
| Authentication | User/Pass + Client Certificate (Wallet) | Username / Password only |
| Setup | Download & manage .zip wallet files |
Copy & Paste a connection string |
| Port | 1522 | 1521 |
| Best For | High-security corporate apps | Quick prototyping |
By default, it is set to “mTLS only” for security. TLS authentication is easier to use, provides better connection latency, and does not require you to download client credentials (wallet) when using many modern drivers (like the python-oracledb driver in “Thin” mode). If your database is currently set to “Secure access from everywhere” without any rules, Oracle prevents you from disabling mTLS to protect you from brute-force attacks. To switch to TLS, the database must first be secured by either an Access Control List (ACL) or a Private Endpoint.
Since I will be building a frontend for multimodal search in a serverless environment, I will enable TLS as it works much better with a simple connection string than trying to manage an encrypted zip file in a serverless environment.
Enable TLS
Enable the Access Control List (ACL):
- On the Autonomous Database Details page, go to Network section.
- Click Edit next to Access Control List.
- In the slide-out, click + Add My IP Address. This ensures only the current computer can talk to the database.
- Click Update. The database will enter an “Updating” state for a moment.
Update mTLS Setting:
- Wait for the DB to show the green Available status.
- Click the Edit button next to Mutual TLS (mTLS) Authentication
- uncheck the box “Require mutual TLS (mTLS) authentication.”
- Click Update.
Obtain Connection string
- Click “Database Connection”: Look for this button near the top of the page, right next to “Stop” and “Restart”
- Look for a dropdown or toggle labeled Mutual TLS (mTLS) and change it to TLS.
- In the table of connection strings below (usually named
high,medium, andlow), find thehighservice (this gives the most resources to vector search). - Click Copy on that connection string. It will look something like this:
(description= (address=(protocol=tcps)(port=1522)(host=adb.us-ashburn-1.oraclecloud.com))(connect_data=(service_name=db_name_high.adb.oraclecloud.com))(security=(ssl_server_dn_match=yes)))
Phase 3: Upload 25k Embeddings to Database
Since we already have the vectors in pickle format, we will use the python-oracledb driver to push them from our local environment to the DB.
Create a table
The SQL Worksheet is our primary interface for running SQL and PL/SQL commands against our database instance.
Navigate to Oracle Database > Autonomous Database.
Click on the database name (
VectorDB).On the Database Details page, look for the Database Actions button and select SQL.
This opens a web-based IDE (SQL Worksheet) where you can paste and run all scripts.
Create the Table: Run this in the SQL Worksheet
CREATE TABLE image_search ( id NUMBER PRIMARY KEY, image_name VARCHAR2(255), embedding VECTOR(512, FLOAT32) -- CLIP-ViT-B-32 produces 512-dim vectors );
Upload embeddings to new table
- Install Driver with
pip install oracledb - Run this script:
import oracledb
import pickle
# Database instance credentials
user = "ADMIN"
password = "Admin_Password"
# The Connection string from Phase 2
dsn_str = "(description= (address=(protocol=tcps)(port=1522)(host=...)))"
# Connect in THIN mode (no wallet needed)
connection = oracledb.connect(
user=user,
password=password,
dsn=dsn_str
)
print("Successfully connected using TLS!")
with open('embeddings.pkl', 'rb') as f:
data = pickle.load(f) # Assuming list of (id, name, vector)
# 1. Extract the two parts from raw_data tuple
filenames = raw_data[0]
embeddings = raw_data[1]
# 2. Reformat into a list of tuples for the database
# We'll use the loop index 'i' as a unique ID
formatted_data = []
for i in range(len(filenames)):
img_name = filenames[i]
# Ensure the vector is a standard Python list of floats
vector = embeddings[i].tolist()
formatted_data.append((i, img_name, vector))
# 3. Batch Insert into Autonomous Database
with connection.cursor() as cursor:
# 1. Define the SQL
sql = "INSERT INTO image_search (id, image_name, embedding) VALUES (:1, :2, :3)"
# 2. specify the column types
# oracledb.DB_TYPE_NUMBER for ID
# oracledb.DB_TYPE_VARCHAR for Name
# oracledb.DB_TYPE_VECTOR for the Embedding
cursor.setinputsizes(oracledb.DB_TYPE_NUMBER, oracledb.DB_TYPE_VARCHAR, oracledb.DB_TYPE_VECTOR)
# 3. Execute
cursor.executemany(sql, formatted_data, batch_size=1000)
connection.commit()
print(f"Success! Uploaded {len(formatted_data)} records.")When this is done, we will see Success! Uploaded 24996 records. in the console. We can verify the data in SQL Worksheet with:
SELECT count(*) FROM image_search;
-- And check the vector format:
SELECT id, image_name, embedding FROM image_search FETCH FIRST 1 ROWS ONLY; Phase 4: Convert CLIP to ONNX and Load into DB
To use LLMs in the database, we need to convert the model into ONNX format. Oracle provides preconfigured ONNX pipeline models. These are augmented pipelines that include tokenization and post-processing steps for generating vector embeddings in the Oracle Database.
If your desired model is not prebuilt by Oracle, you can use the Oracle Machine Learning for Python (OML4Py) utility to convert your model.
Export to ONNX (Local Python)
Oracle’s oml library can wrap the CLIP model with the necessary pre-processing (like image decoding, resizing, normalization) so it works natively in SQL. Run this in an environment with OML4Py installed:
from oml.utils import ONNXPipeline, ONNXPipelineConfig
# Use the built-in multimodal CLIP template
config = ONNXPipelineConfig.from_template("multimodal_clip")
# Specify the exact CLIP-ViT-B-32 model from Hugging Face
pipeline = ONNXPipeline("openai/clip-vit-base-patch32", config=config)
# Export to files (generates two ONNX files with preprocessing + normalization)
pipeline.export2file("clip_vit_b32")Notice that this creates 2 files:
clip_vit_b32_txt.onnx(text encoder)clip_vit_b32_img.onnx(image encoder)
Unlike the original multimodal model which is a single file, each modality file contains its own pre-processing and post-processing pipeline.
Upload Files to OCI Object Storage
Next we are going to upload both .onnx files to a free-tier OCI Object Storage bucket accessible from our Autonomous Database. The Always-Free tier offers 20 GB of capacity across Standard, Infrequent Access, and Archive Storage tiers with an S3-compatible API for integration with existing tools and applications.
1. Create Bucket
In OCI Console, go to Storage > Buckets
- Bucket Name: Provide a clear, unique name (e.g.,
AI_Models).
For the remaining options:
| Option | Setting | Reason |
|---|---|---|
| Default Storage Tier | Standard | The database requires fast, immediate access to the model file during the loading process. The Archive tier will cause the load command to fail because the data must be manually restored before use. |
| Enable Auto-Tiering | Disabled (Optional) | While good for costs, it’s safer to keep models in Standard to avoid latency during re-loads. |
| Enable Object Versioning | Enabled | This allows us to keep track of different versions of the model file without overwriting the original. |
| Emit Object Events | Disabled | You likely don’t need automation triggers (like OCI Functions) for this simple use case. |
| Uncommitted Multipart Uploads Cleanup | Enabled | Good hygiene to clean up failed large-file uploads after 7 days. |
| Encryption | Encrypt using Oracle-managed keys | Using customer-managed keys adds significant complexity (Vault setup) that is usually unnecessary for this phase. |
| Bucket Visibility | Private (the default) | Never make the bucket public for model storage. Instead, use a Pre-Authenticated Request (PAR) or Database Credentials to give the database secure, temporary access. |
2. Upload Model
After creating a bucket, click Upload Object to upload the onnx file. When you click Upload in the OCI Bucket, use these specific settings for the model.onnx file:
- Storage Tier: Keep this as Standard (Inherited from bucket).
- Object Name: Use a simple name like
clip_multimodal.onnx. Avoid special characters or spaces.
3. Grant Database Access
Once we upload the model.onnx file, the easiest way for the database to “see” the file is to create a Pre-Authenticated Request (PAR). This generates a temporary, secure URL that the database can use without needing complex login credentials.
- In the Bucket, find the
clip_multimodal.onnxfile and click the three dots (Actions menu). - Select Create Pre-authenticated Request.
- Access Type: Select Object and Permit object reads.
- Expiration: Set it for a few weeks (enough time to finish the setup).
- Click Create and COPY THE URL IMMEDIATELY. You will not be able to see this URL again.
Load Model into the Database
Autonomous Database supports importing ONNX models for classification, regression, clustering, and embeddings. The models will be imported as first-class MINING MODEL objects in the database schema. Inference can then be done using the family of OML scoring operators, including PREDICTION, CLUSTER, and VECTOR_EMBEDDING.
In the following SQL statement, we use DBMS_CLOUD.GET_OBJECT to fetch each file into a BLOB, then we import the CLIP model via DBMS_DATA_MINING.IMPORT_ONNX_MODEL.
DECLARE
ONNX_MOD_FILE VARCHAR2(100) := 'clip_vit_base_patch32_txt.onnx';
MODNAME VARCHAR2(500);
LOCATION_URI VARCHAR2(200) := 'PAR URI';
BEGIN
DBMS_OUTPUT.PUT_LINE('ONNX model file name in Object Storage is: '||ONNX_MOD_FILE);
--------------------------------------------
-- Define a model name for the loaded model
--------------------------------------------
SELECT UPPER(REGEXP_SUBSTR(ONNX_MOD_FILE, '[^.]+')) INTO MODNAME from dual;
DBMS_OUTPUT.PUT_LINE('Model will be loaded and saved with name: '||MODNAME);
-----------------------------------------------------
-- Read the ONNX model file from Object Storage into
-- the Autonomous Database data pump directory
-----------------------------------------------------
BEGIN DBMS_VECTOR.DROP_ONNX_MODEL(model_name => MODNAME);
EXCEPTION WHEN OTHERS THEN NULL; END;
DBMS_CLOUD.GET_OBJECT(
credential_name => NULL,
directory_name => 'DATA_PUMP_DIR',
object_uri => LOCATION_URI||ONNX_MOD_FILE);
-----------------------------------------
-- Load the ONNX model to the database
-----------------------------------------
DBMS_VECTOR.LOAD_ONNX_MODEL(
directory => 'DATA_PUMP_DIR',
file_name => ONNX_MOD_FILE,
model_name => MODNAME);
DBMS_OUTPUT.PUT_LINE('New model successfully loaded with name: '||MODNAME);
END; Verify loaded models by running SELECT model_name, mining_function FROM user_mining_models; to confirm clip_vit_base_patch32_txt is now a “first-class citizen” in our DB.
Phase 5: Create an index
We will need to set up a vector index on our embedding column for significantly faster similarity searches, especially as our dataset grows. Without an index, queries perform exhaustive full-table scans, which slow down noticeably. With an index, Oracle uses approximate nearest neighbor (ANN) search for sub-second response times on large datasets while maintaining high accuracy.
Oracle Database 26ai supports advanced vector indexes like HNSW (Hierarchical Navigable Small World) for excellent accuracy-recall balance with cosine distance (ideal for normalized CLIP embeddings).
CREATE VECTOR INDEX img_hnsw_idx
ON image_search (embedding)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95 DISTANCE COSINE: Required for CLIP embeddings (which are L2-normalized; cosine is the standard metric).WITH TARGET ACCURACY 95: Optional but recommended—controls the recall/accuracy tradeoff during approximate searches (higher = better accuracy, slightly slower). You can adjust (e.g., 90–99) or omit for default.
HNSW indexes are in-memory only and load into the database’s Vector Pool. For large datasets (millions of 512-dim vectors), this may take time/memory. The Vector Pool is often auto-managed, and can be monitored via V$VECTOR_MEMORY_POOL.
Run this to check our index is properly created:
SELECT index_name,
index_type,
table_name,
status
FROM all_indexes
WHERE owner = 'ADMIN'
ORDER BY index_name;We will see our img_hnsw_idx index showing up with
INDEX_TYPE = 'VECTOR'
STATUS = VALID
We can also check vector index metadata in USER_IND_COLUMNS to confirm the column the vector index is built on:
SELECT index_name,
column_name,
column_position
FROM user_ind_columns
WHERE index_name = 'IMG_HNSW_IDX';which will output
INDEX_NAME COLUMN_NAME COLUMN_POSITION
------------ ----------- ---------------
IMG_HNSW_IDX EMBEDDING 1
Phase 6: Perform a Similarity Search
Now we can now use our CLIP_MODEL directly in SQL to search our images using plain English.
- We call
VECTOR_EMBEDDING()with our model, passing to it a search text as bind variable - We then use the
VECTOR_DISTANCE()function to compute the distance between our embedding column and the query embedding. UseCOSINEdistance since the pipeline normalizes embeddings. - We order the output by the vector distance between our query and existing embeddings.
- The database uses our HNSW index automatically for
FETCH APPROX.
SELECT image_name,
VECTOR_DISTANCE(embedding, VECTOR_EMBEDDING(clip_vit_base_patch32_txt USING :search_text), COSINE) as score
FROM image_search
ORDER BY score
FETCH APPROX FIRST 10 ROWS ONLY;or
SELECT image_name
FROM image_search
ORDER BY VECTOR_DISTANCE(
embedding,
VECTOR_EMBEDDING(CLIP_MODEL USING :search_text),
COSINE
)
FETCH FIRST 5 ROWS ONLY;Benchmarking Latency
Let’s compare different ways to embed and search
- Using SentenceTransformer to embed, as shown in sample documentation and my previous post
- Convert the CLIP model to ONNX and embed using a local ONNX runtime
- Connect to our Oracle database from a Python client and perform embed and search in the database
- Directly embed and search in the database using SQL Developer
SentenceTransformer
from sentence_transformers import SentenceTransformer, util
import time
#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')
start_time = time.perf_counter()
text_emb = model.encode('Two dogs in the snow')
end_time = time.perf_counter()
latency_ms = (end_time - start_time) * 1000
print(f"\nLocal Inference Time: {latency_ms:.2f} ms")Output:
Local Inference Time: 130.43 ms
Local ONNX
In this test, we are using the same augmented ONNX model from Oracle to ensure a fair comparison.
import numpy as np
import onnxruntime as ort
from onnxruntime_extensions import get_library_path
from pathlib import Path
import time
# 1. Setup Path
current_dir = Path.cwd()
model_file = "clip_vit_base_patch32_txt.onnx"
if not model_file.exists():
print(f"❌ File not found at: {model_file}")
else:
# 2. Register the Custom Ops
so = ort.SessionOptions()
so.register_custom_ops_library(get_library_path())
# 3. Load Session with Custom Options
session = ort.InferenceSession(str(model_file), sess_options=so)
print("Model loaded successfully with Custom Extensions!")
# 4. Prepare Input
# Note: Augmented models exported from Oracle often expect 'TEXT' or 'DATA'
input_name = session.get_inputs()[0].name
text_input = np.array(["a photo of a blue car"], dtype=object)
# 5. Benchmark
start_time = time.perf_counter()
outputs = session.run(None, {input_name: text_input})
end_time = time.perf_counter()
latency_ms = (end_time - start_time) * 1000
print(f"\nLocal Inference Time: {latency_ms:.2f} ms")
print("Vector (First 5 dims):", outputs[0][0][:5])Output:
Model loaded successfully with Custom Extensions!
Local Inference Time: 29.47 ms
Vector (First 5 dims): [ 0.01946913 -0.00345288 -0.01435798 0.02153437 -0.00712121]
Python with Oracle
import time
import statistics
def run_separated_benchmarks(text_query="a photo of a blue car", k=5, num_search_runs=10):
print(f"Separating embedding (once) from search ({num_search_runs} times)")
print(f"Query: '{text_query}' (k={k})")
print("First embedding may be cold; searches will benefit from warmed model/index.")
print("-" * 60)
embedding_latency_ms = None
search_latencies_ms = []
with pool.acquire() as connection:
with connection.cursor() as cursor:
# --- PHASE 1: Single Embedding (timed separately) ---
t0 = time.perf_counter()
cursor.execute("""
SELECT VECTOR_EMBEDDING(CLIP_VIT_BASE_PATCH32_TXT USING :txt AS DATA)
FROM DUAL
""", txt=text_query)
query_vector = cursor.fetchone()[0]
t1 = time.perf_counter()
embedding_latency_ms = (t1 - t0) * 1000
print(f"Embedding Latency: {embedding_latency_ms:.2f} ms")
print(f"Query Vector (first 5 dims): {query_vector.toarray()[0][:5] if hasattr(query_vector, 'toarray') else query_vector[:5]}")
print("-" * 60)
# --- PHASE 2: Multiple Searches Using the Same Vector ---
for i in range(num_search_runs):
t2 = time.perf_counter()
cursor.execute("""
SELECT /*+ NO_PARALLEL USE_VECTOR_INDEX(img_hnsw_idx) */
image_name,
VECTOR_DISTANCE(embedding, :bv, COSINE) AS distance
FROM image_search
ORDER BY distance
FETCH APPROX FIRST :k ROWS ONLY
WITH TARGET ACCURACY 90
""", bv=query_vector, k=k)
results = cursor.fetchall()
t3 = time.perf_counter()
search_latency_ms = (t3 - t2) * 1000
search_latencies_ms.append(search_latency_ms)
print(f"Search Run {i+1:2d}: {search_latency_ms:6.2f} ms | Top result: {results[0][0]} (dist: {results[0][1]:.4f})")
# Search Statistics
avg_search = sum(search_latencies_ms) / len(search_latencies_ms)
min_search = min(search_latencies_ms)
max_search = max(search_latencies_ms)
median_search = statistics.median(search_latencies_ms)
print("-" * 60)
print(f"Embedding (once) : {embedding_latency_ms:.2f} ms")
print(f"Search Average : {avg_search:.2f} ms")
print(f"Search Min : {min_search:.2f} ms")
print(f"Search Max : {max_search:.2f} ms")
print(f"Search Median : {median_search:.2f} ms")
print(f"Steady-State Total (~embedding + search): {embedding_latency_ms + median_search:.2f} ms")
# Run it
run_separated_benchmarks()Output:
Separating embedding (once) from search (10 times)
Query: 'a photo of a blue car' (k=5)
First embedding may be cold; searches will benefit from warmed model/index.
------------------------------------------------------------
Embedding Latency: 73.18 ms
Query Vector (first 5 dims): array('f', [0.019469130784273148, -0.003452877514064312, -0.014357976615428925, 0.021534373983740807, -0.007121213711798191])
------------------------------------------------------------
Search Run 1: 124.96 ms | Top result: bQATESa8mig.jpg (dist: 0.6908)
Search Run 2: 119.27 ms | Top result: bQATESa8mig.jpg (dist: 0.6908)
Search Run 3: 107.14 ms | Top result: bQATESa8mig.jpg (dist: 0.6908)
Search Run 4: 108.00 ms | Top result: bQATESa8mig.jpg (dist: 0.6908)
Search Run 5: 108.30 ms | Top result: bQATESa8mig.jpg (dist: 0.6908)
Search Run 6: 107.83 ms | Top result: bQATESa8mig.jpg (dist: 0.6908)
Search Run 7: 107.59 ms | Top result: bQATESa8mig.jpg (dist: 0.6908)
Search Run 8: 107.66 ms | Top result: bQATESa8mig.jpg (dist: 0.6908)
Search Run 9: 109.31 ms | Top result: bQATESa8mig.jpg (dist: 0.6908)
Search Run 10: 107.37 ms | Top result: bQATESa8mig.jpg (dist: 0.6908)
------------------------------------------------------------
Embedding (once) : 73.18 ms
Search Average : 110.74 ms
Search Min : 107.14 ms
Search Max : 124.96 ms
Search Median : 107.92 ms
Steady-State Total (~embedding + search): 181.10 ms
Notice that this steady-state latencies (~73 ms embedding + ~108 ms search, total ~181 ms) include client-side overhead from my local machine (west coast) to the Oracle Cloud Infrastructure (OCI) region hosting my Autonomous Database instance in Phoenix.
Each benchmark query involves:
- Network: assuming ~30-50 ms per query (dominant in search latency).
- One round-trip (send SQL + receive small results: top-5 rows, negligible data transfer).
- Potential extra for connection setup/handshake if not pooled perfectly.
- Pure database execution: (which we will test next)
- Embedding: (ONNX inference CPU-bound).
- Search: (HNSW traversal + distance calculation).
Measuring Pure Database Performance (Excluding Network)
Let’s check both the embedding and query phases to see if there’s any gain by moving to Oracle Database. Note that the first run for both embedding and query will incur cold-start latency, so a warm up is performed to normalize the time.
To isolate server-side execution time, I will clock both phases inside the database in Database Actions → SQL Worksheet (browser-based, executes server-side with negligible client overhead for timing). I will use a PL/SQL block using DBMS_UTILITY.GET_TIME (10 ms resolution) or SYSTIMESTAMP (fractional seconds precision). I will fetch the top-5 results during search timing to ensure the full ANN execution runs:
SET SERVEROUTPUT ON;
DECLARE
l_start TIMESTAMP WITH TIME ZONE;
l_end TIMESTAMP WITH TIME ZONE;
l_embedding VECTOR;
diff INTERVAL DAY TO SECOND;
total_ms NUMBER;
-- Cursor fetch (to ensure full ANN execution is timed) for search results
CURSOR c_search IS
SELECT /*+ NO_PARALLEL USE_VECTOR_INDEX(img_hnsw_idx) */ image_name
FROM image_search
ORDER BY VECTOR_DISTANCE(embedding, l_embedding, COSINE)
FETCH APPROX FIRST 5 ROWS ONLY WITH TARGET ACCURACY 90;
l_dummy VARCHAR2(32767);
BEGIN
-- Optional warm-up
SELECT VECTOR_EMBEDDING(CLIP_VIT_BASE_PATCH32_TXT USING 'warmup text' AS DATA)
INTO l_embedding FROM DUAL;
-- Embedding Timing
l_start := SYSTIMESTAMP;
SELECT VECTOR_EMBEDDING(CLIP_VIT_BASE_PATCH32_TXT USING 'a photo of a blue car' AS DATA)
INTO l_embedding FROM DUAL;
l_end := SYSTIMESTAMP;
diff := l_end - l_start;
total_ms :=
EXTRACT(DAY FROM diff) * 86400000 +
EXTRACT(HOUR FROM diff) * 3600000 +
EXTRACT(MINUTE FROM diff) * 60000 +
EXTRACT(SECOND FROM diff) * 1000;
DBMS_OUTPUT.PUT_LINE('Pure Embedding Time: ' || TO_CHAR(total_ms, 'FM9990.000') || ' ms');
-- Search Timing
l_start := SYSTIMESTAMP;
OPEN c_search;
FOR i IN 1..5 LOOP
FETCH c_search INTO l_dummy;
EXIT WHEN c_search%NOTFOUND;
END LOOP;
CLOSE c_search;
l_end := SYSTIMESTAMP;
diff := l_end - l_start;
total_ms :=
EXTRACT(DAY FROM diff) * 86400000 +
EXTRACT(HOUR FROM diff) * 3600000 +
EXTRACT(MINUTE FROM diff) * 60000 +
EXTRACT(SECOND FROM diff) * 1000;
DBMS_OUTPUT.PUT_LINE('Pure Search Time: ' || TO_CHAR(total_ms, 'FM9990.000') || ' ms');
END;
/Running this multiple times in SQL Worksheet, we get these results:
Pure Embedding Time: 12.730 ms\
Pure Search Time: 2.118 ms\
Pure Embedding Time: 12.524 ms\
Pure Search Time: 2.203 ms\
Pure Embedding Time: 12.960 ms\
Pure Search Time: 2.199 ms
This is sub-20 ms end-to-end inside the database, which is blazing fast for a general-purpose RDBMS with custom ONNX models and vector indexing, all hosted in a free-tier shared infrastructure! Breaking it down:
- Pure Embedding Time (in-DB ONNX CLIP-ViT-B-32 text inference): ~12.5–13 ms consistently. ONNX CLIP inference is fast when cached
- Pure Search Time (HNSW ANN on 25k × 512-dim vectors is extremely quick in-memory, top-5): ~2.1–2.2 ms consistently.
- Total Pure Database Execution: ~14.6–15.2 ms per query (embedding + search).
On scaled OCPU or dedicated hardware, sub-10 ms search is standard for this scale. Our runs prove that the HNSW index + query are extremely efficient server-side due to:
- Full in-memory caching: After 1–2 runs, the entire HNSW graph resides in the Vector Memory Pool (SGA), and the ONNX model session is persistently loaded. The “cold start” penalty (model/index loading) is gone in these runs.
- Result/Plan Caching: The query is deterministic (same text → same embedding vector → same top-5 results). Oracle aggressively caches:
- Scalar subquery results (embedding vector).
- Full query results (via implicit result cache for small outputs).
- Execution plans.
- Optimized Path: No network, no row fetching overhead in our timing (cursor fetch is minimal), pure ANN traversal on warmed data → often <10 ms real time for 25k × 512-dim vectors on decent CPU.
compare with client-side
This pure numbers should confirm network’s share in our earlier client-side benchmarks (~73 ms embed + ~108 ms search = ~181 ms total).
- ~165 ms was network + client overhead (mostly round-trip latency from west coast to OCI Phoenix region—likely ~50–100 ms RTT + oracledb handling).
- Pure DB work is only ~8–10% of what we saw from a local machine!
If we want to optimize further, we can consider co-locating our app closer to the ADB region (e.g., OCI Compute in same region) for sub-100 ms end-to-end.
compare with pgvector/chroma
As for local pgvector/Chroma (~1–5 ms search, ~29 ms local embed), Oracle matches or beats on search when normalized for free-tier hardware/shared cloud infra, and wins query embedding with in-DB embedding <13ms, due to no data movement/external calls.
Latency Comparison Breakdown
For a dataset of 25k images, we have seen in the above tests how the switch from a “Local Python + Vector DB” setup to “In-Database AI” impacts latency:
| Phase | Local Pipeline (PyTorch / ONNX + pgvector/Chroma) | Oracle OADB 26ai (In‑Database ONNX + HNSW) | Technical Notes / Why OADB Wins |
|---|---|---|---|
| Embedding Generation | PyTorch CPU: ~130 ms Local ONNX Runtime: ~30 ms |
In‑Database ONNX Runtime: ~13 ms | OADB uses the same C++ ONNX Runtime engine we use locally. The speedup vs PyTorch comes from skipping Python + Torch overhead. |
| IPC / Data Transfer | 2–10 ms (Python → DB driver serialization → network hop → vector DB deserialization) | 0 ms | OADB executes embedding + vector search inside the kernel. No serialization, no network, no Python IPC. |
| Similarity Search (25k vectors) | 1–5 ms (pgvector/Chroma + network) | 2 ms | For 25k vectors, HNSW search is sub‑millisecond on any modern CPU. OADB’s HNSW index is implemented natively inside the SQL engine with shared‑memory access. |
| Total Latency | ~140 ms (PyTorch path) ~33–40 ms (ONNX local path) |
~15 ms | OADB removes Python overhead and network hops. Gains are largest when replacing PyTorch. |
| Overall Improvement | — | ~4x faster embedding vs PyTorch 2.3× faster embedding vs local ONNX 55–67% faster end‑to‑end vs local ONNX ~89% faster end‑to‑end vs PyTorch pipelines |
The biggest win is eliminating Python + network + driver overhead. |
Massive End‑to‑End Latency Reduction
- ~89% faster end‑to‑end vs existing PyTorch + pgvector/Chroma pipeline
(133–145 ms → 15 ms)
- 55–67% faster end‑to‑end vs local ONNX + pgvector/Chroma pipeline
(33–45 ms → 15 ms)
Embedding Performance Gains
4.3× faster by replacing PyTorch with ONNX, (130 ms → 30 ms)
2.3× faster embedding vs local ONNX
(30 ms → 13 ms)
10× faster embedding vs PyTorch when using OADB’s in‑database ONNX
(130 ms → 13 ms)Query Embedding Phase Gain: The primary gain here isn’t the inference speed (ONNX is ONNX), but the elimination of context switching. In a local setup, the CPU must switch context from the Python app to the Database process. In OADB, the execution stays within the database’s thread/process context.
Search Phase Gain: In pgvector or Chroma, the search engine receives a “blob” or “array” of numbers via the SQL/gRPC protocol. It must validate and parse this. In OADB, the result of the
VECTOR_EMBEDDINGfunction is already a nativeVECTORtype in memory, ready for theVECTOR_DISTANCEoperator.
Potential batch throughput gains
While the latency gain for a single query is modest (about 10-20ms saved), the throughput difference for batch operations can be massive:
- Parallel Execution: OADB can parallelize the
VECTOR_EMBEDDINGfunction across all available OCPUs. In a local Python setup, you are often limited by the Python Global Interpreter Lock (GIL) or the overhead of managing a multiprocessing pool to feed the DB. - Pipelining: In OADB, the database engine can pipeline the embedding of one query while the distance operator is searching for the previous one.
- Batch Ingestion: If we ever need to re-embed our 25k images (e.g., moving to a newer CLIP model), OADB can do this with a single
UPDATEstatement. This will be 3x–5x faster than a Python script that has to pull the image, embed it, and send the vector back to the DB over a network/socket.
Conclusion
In this post, I have scratched the surface of Oracle AI Autonomous Database by testing its in-memory ONNX runtime and integrated vector store. I’ve shown that it delivers impressive embedding and query result by:
- Eliminating Python overhead, driver serialization, and network hops
- Running embedding + vector search inside the database kernel
- Providing a predictable, low‑variance latency ideal for real‑time search UX
In my next post, I will create a search frontend and test OADB with image and spatial query. I am also curious about 26ai’s native JSON support that should make my web scrap ingestion much faster. I also want to try out the Natural-Language-to-SQL via Select AI, which allows a user to ask questions in English and have the database calls a preconfigured LLM provider to generate SQL statement for query, or peruse the local vector index for RAG. Stay tuned!