splitbrain.org

electronic brain surgery since 2001

Writing /var/www/html/data/log/deprecated/2024-11-07.log failed

Using SQLite as Vector Store in PHP

At work, I'm working on a plugin for DokuWiki that uses OpenAI's ChatGPT to answer questions by referencing the wiki pages.

The basic principle is described in this forum post: Experimental Plugin: AIChat - chat with your wiki pages using ChatGPT.

Basically, get embedding vectors for all your page data, use those to find the data that is semantically most similar to the asked question and use that as a context for ChatGPT.

One of the challenges when working with embedding vectors in PHP is that you're limited in all directions. Your process is only allowed a certain amount of RAM and has to complete in a limited amount of time. Your language is limited to what is commonly available and your users can usually not install any non-standard PHP extensions or software.

This means you cannot simply load any of the many available efficient vector stores like FAISS or simply fire up a vector database.

So in my implementation I used SQLite as simple vector store, placing the 1536 dimensional vectors as a simple JSON blob.

CREATE TABLE embeddings
(
    id        INTEGER   NOT NULL PRIMARY KEY,
    page      TEXT      NOT NULL,
    embedding BLOB      NOT NULL,
    chunk     TEXT      NOT NULL,
    created   TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

This means a nearest-neighbor search using cosine similarity will always do a full table scan.

-- COSIM() is a PHP function registered as sqlite function
  SELECT *, COSIM(:query, embedding) AS similarity
    FROM embeddings
ORDER BY similarity DESC
   LIMIT :k

In my first tests that was fine. But of course the performance depends heavily on how many rows are in your table. For a small amount (300 rows) the search for the nearest neighbor is quite fast at around 0.3s on my local machine. However with about 14,000 rows, that time was already at 11s!

So when talking to my girlfriend, who's an actual data scientist, about how to improve this behavior she suggested clustering the data.

The idea is simple. When creating the embeddings index, we can check which vectors are close to each other. These vectors represent semantically similar topics and we can give them a cluster ID. Each cluster is stored in the database together with it's center.

CREATE TABLE clusters
(
    cluster   INTEGER   NOT NULL PRIMARY KEY,
    centroid  BLOB      NOT NULL,
);
 
ALTER TABLE embeddings
 ADD COLUMN cluster INTEGER REFERENCES clusters(cluster);

During query time we simply check which cluster center is closest, then only do a cosine distance search within that cluster.

-- get the nearest cluster
  SELECT cluster 
    FROM clusters
ORDER BY COSIM(centroid, :query) DESC
   LIMIT 1
 
-- we now limit the search by cluster
  SELECT *, COSIM(:query, embedding) AS similarity
    FROM embeddings
   WHERE cluster = :cluster
ORDER BY similarity DESC
   LIMIT :k

But how do you actually create the clusters in the first place? This is where the K-Means Clustering algorithm comes into place. I used the bdelespierre/php-kmeans implementation for that.

You feed the algorithm your vectors and then tell it to cluster them into a given amount of clusters. However during clustering, the algorithm needs to keep all data in memory. So we can not just feed it all our vectors. Instead I use random sampling to create the initial clusters. It looks a bit like this, but in 1536 dimensions instead of two ;-). The blue dots are my random sample. The circles the 4 clusters.

Randomly selected vectors clustered

Once the sample has been clustered, I can simply use the cosine similarity again to pick the nearest existing cluster for each vector in the database.

All vectors assigned to the nearest cluster

Of course the above simplified example isn't really ideal. The number of samples you pick and the amount of clusters you create is important for the quality of the final results.

How many samples you pick is only limited by the amount of RAM you want to use while creating the clusters. I settled on 2,000 vectors which results in about 120MB peak memory in my script. But using more would always be better.

The number of clusters on the other hand, decides how many vectors you have to search when doing the similarity query. So you might think having many clusters is best. However creating more clusters will not only increase the time the k-means algorithm runs, but you may also limit your search space too much.

In my implementation I settled on calculating the number of clusters based on the number of rows I have. Dividing my row count by 400, eg. aiming for an average of 400 vectors per cluster.

So far these values seem to work fine, but are far from scientific. One approach that I have not tried so far, is to use more but smaller clusters and then search the nearest two or three.

Anyway if you're interested, feel free to look at the code of the SQLiteStorage class.

If you know of any other interesting approaches to do efficient k-nearest neighbor search on high-dimensional vectors in PHP, please let me know.

Tags:
dokuwiki, php, ai, chatgpt, vectorstore, sqlite
Similar posts: