Vector Database

Build

To enable the feature, build with

-DWITH_FB_VECTORDB=1 -DWITH_OPENMP=$WITH_OPENMP -DWITH_OPENBLAS=$WITH_OPENBLAS

Vector/embedding data types

Vectors/embeddings can be represented as JSON or BLOB data types. BLOB data type is more efficient and is recommended.

JSON backed vector

A vector is stored as a JSON array with numbers. Each vector element takes 11 bytes (json_binary.h).

BLOB backed vector

A vector is stored as binary serialized floating point numbers. Each vector element takes 4 bytes.

Vector dimension attribute

A vector column must have a FB_VECTOR_DIMENSION attribute to specify the dimension of the vector. For example:

CREATE TABLE json_embeddings (
  embedding JSON NOT NULL FB_VECTOR_DIMENSION 4
}

CREATE TABLE blob_embeddings (
  embedding BLOB NOT NULL FB_VECTOR_DIMENSION 4
}

Populating vector columns

insert into json_embeddings values ('[1, 2, 3, 4]');

insert into blob_embeddings values (FB_VECTOR_JSON_TO_BLOB('[1, 2, 3, 4]'));

Vector functions

Vector metric functions

See MetricType and distances · facebookresearch/faiss Wiki for descriptions of L2, IP and Cosine vector metrics.

L2

FB_VECTOR_L2

select FB_VECTOR_L2('[1, 2, 3]', '[1, 2, 3]') as v1;

IP

FB_VECTOR_IP

select FB_VECTOR_IP('[1, 2, 3]', '[1, 2, 3]') as v1;

Cosine

There is no native function to compute cosine value
Normalize the vectors inserted to the table and the query input vector
- FB_VECTOR_NORMALIZE_L2 is a function to normalize a vector
Compute IP of the vectors

select FB_VECTOR_IP(FB_VECTOR_NORMALIZE_L2('[1, 2, 3]'), FB_VECTOR_NORMALIZE_L2('[2, 3, 5]')) as cosine;

Blob conversion functions

FB_VECTOR_JSON_TO_BLOB. Convert json array to BLOB.
FB_VECTOR_BLOB_TO_JSON. Convert blob backed vector to JSON.

insert into t1 values (1, FB_VECTOR_JSON_TO_BLOB('[1,2,3]'));

select FB_VECTOR_BLOB_TO_JSON(blob1) from t1;

Similarity search

# KNN by L2, k = 10. 
select id, name, FB_VECTOR_L2(vector1, '[10, 6, 5]') as dis from t1 ORDER BY dis  LIMIT 10;

# KNN by IP, k = 10. 
select id, name, FB_VECTOR_IP(vector1, '[10, 6, 5]') as dis from t1 ORDER BY dis DESC  LIMIT 10;

select id, name from t1 order by FB_VECTOR_L2(vector1, '[10, 6, 5]') LIMIT 10;

# query with distance filter
select id,name, FB_VECTOR_L2(vector1, '[1, 2, 3, 4]') as dis from t1 HAVING dis<10.24 order by dis limit 128;

ATTENTION! A similarity search query MUST have:

An order by clause with vector metric functions.
- The first parameter of the function should be the vector column.
- The second parameter of the function should be the query vector.
- The ordering direction should be ASC for L2 and DESC for IP.
A limit clause with the k value.

Vector index

You can create an index on a vector column to speed up vector similarity search performance (backed by Faiss indexes · facebookresearch/faiss Wiki · GitHub).

CREATE TABLE t1 (
  id BIGINT NOT NULL PRIMARY KEY,
  vector1 BLOB NOT NULL FB_VECTOR_DIMENSION 768,
  name varchar(64) collate utf8mb4_bin,
  INDEX key1(vector1) FB_VECTOR_INDEX_TYPE 'flat'
) ENGINE = rocksdb;

Vector columns should be defined with NOT NULL FB_VECTOR_DIMENSION {dimension}.
Only the “rocksdb” storage engine is supported.

Vector index syntax

The vector index can have one vector column.
You need to set vector index options:
- FB_VECTOR_INDEX_TYPE. Vector index type, required. Supported values: “flat”, “ivfflat” and “ivfpq”.
- FB_VECTOR_TRAINED_INDEX_TABLE. For IVF type vector indexes. The auxiliary table name used to store trained IVF index data. See details in next section.
- FB_VECTOR_TRAINED_INDEX_ID. For IVF type vector indexes. The trained index id.

IVF auxiliary table

IVF index requires an auxiliary table to store its configuration. The auxiliary table should have the schema like:

create table VECTORDB_DATA (
  id varchar(128) not null,
  type varchar(128) not null,
  seqno int not null,
  value JSON not null,
  primary key (id, type, seqno)
);

The auxiliary table should be created in the same database as the IVF index. Before creating an IVF index, we need to create the auxiliary table and populate it with the IVF index data.

IVF index data is stored in the auxiliary table as 3 types of records:

metadata. The type is set to "metadata". The seqno is set to 0. Only one metadata record is permitted per index. The value is a json object with the following fields:
- version. The version of the index data. Currently set to 1.
- nlist. number of centroids in the index.
- pq_m. number of subquantizers in the product quantizer.
- pq_nbits. number of bits per quantization index.
quantizer codes. The type is set to "quantizer". The value is a json array of float numbers. There could be multiple quantizer records per index. The records are ordered by seqno when they are read.
product quantizer codes. The type is set to "product_quantizer". The value is a json array of float numbers. There could be multiple product_quantizer records per index. The records are ordered by seqno when they are read. This is only applicable for IVF PQ index.

Example 1:
A IVF FLAT index with 2 centroids.

insert into VECTORDB_DATA values ('id1', 'metadata', 0, JSON_OBJECT('version', 1, 'nlist', 2));
insert into VECTORDB_DATA values ('id1', 'quantizer', 0, '[0, 0, 0, 0]'), ('id1', 'quantizer', 1, '[1, 1, 0, 0]');

CREATE TABLE t1 (
  id BIGINT NOT NULL PRIMARY KEY,
  vector1 JSON NOT NULL fb_vector_dimension 4,
  INDEX key1(vector1) FB_VECTOR_INDEX_TYPE 'ivfflat' FB_VECTOR_TRAINED_INDEX_TABLE 'VECTORDB_DATA' FB_VECTOR_TRAINED_INDEX_ID 'id1'
);

Example 2:

A IVF PQ index with 4 centroids. See code.

Add/modify IVF index config

Train an index using some existing vectors.
- Code samples: IVFFlat, IVFPQ
Convert the trained index to SQL insert statements that populates the auxiliary table.
Populate the auxiliary table with the insert statements. Make sure to choose a unique id field.
Issue ALTER TABLE ADD INDEX with the new id (drop the old index if it exists).

alter table t1 add index vector_key_1(vector1) FB_VECTOR_INDEX_TYPE 'ivfflat' FB_VECTOR_TRAINED_INDEX_TABLE 'VECTORDB_DATA' FB_VECTOR_TRAINED_INDEX_ID 'new_index_id';

Choose index type

A vector index speeds up similarity search, at a cost of storage space. If your query condition can limit the search space to a few K records using some other indexes, you can get good performance without having a vector index at all.

The table only has a few K records.
The query uses an existing index prefix that limits the search space. For instance, if your primary key is (collection_id, uuid), and in your query you always filter by collection_id, there are only a few thousand records matching a given collection_id.

When your query is slow with regular indexes:

The table has less than 100K records, choose ‘flat’ index.
When your query is slow with a ‘flat’ index and you have more than 100K records, consider the ‘ivfflat’ index.

A simple ‘flat’ index can scale to millions of vectors when the query uses the primary key prefix in the filtering condition. For instance, if your primary key is (collection_id, uuid), and in your query you always filter by collection_id. A ‘flat’ index should provide good performance when the records matching a given collection_id is less than 100K records. You can use this information to design your table schema to avoid the complexities of setting up an IVF index.

Query semantics

IVF indexes only support approximate nearest neighbor (ANN) searches. For an input vector, the server explores the nprobe number of lists instead of the full data set. The result from IVF indexes should be treated as approximate and incomplete.

VectorDB parameters

To see the current set of parameters:

> show variables like 'fb_vector%';
+-----------------------------------+-------+
| Variable_name                     | Value |
+-----------------------------------+-------+
| fb_vector_index_cond_pushdown     | ON    |
| fb_vector_index_cost_factor       | 1000  |
| fb_vector_max_dimension           | 4096  |
| fb_vector_min_dimension           | 3     |
| fb_vector_search_limit_multiplier | 10    |
| fb_vector_search_nprobe           | 16    |
| fb_vector_search_type             | AUTO  |
+-----------------------------------+-------+

fb_vector_search_limit_multiplier. The server fetches (fb_vector_search_limit_multiplier * limit) number of KNN records before executing filtering conditions.
fb_vector_search_nprobe. The number of centroids to explore in the search. This is only useful in IVF indexes. Default value 16.

select /*+ SET_VAR(fb_vector_search_limit_multiplier = 20) */ fb_vector_l2(vector1, '[2.5, 3.5, 2.5]') as dis from t1 where color='red' order by dis limit 10;

select /*+ SET_VAR(fb_vector_search_nprobe = 32) */ fb_vector_l2(vector1, '[0.3, 0.3]')  as dis from t1 force index(key1) order by dis limit 100;

Vector index stats

SELECT * FROM INFORMATION_SCHEMA.ROCKSDB_VECTOR_INDEX;

# get detailed info about your index
set global rocksdb_table_stats_use_table_scan=on;
analyze table your_table_name;
set global rocksdb_table_stats_use_table_scan=default;
select * from information_schema.rocksdb_vector_index;

Documentation license here.

Installation

Build Steps

MyRocks

Overview
Transaction
Backup
- Logical backup with mysqldump
- Physical backup with myrocks_hotbackup
Performance Tuning
Monitoring
Migration
Internals
Vector Database

DocStore

Document column type
Document Path: a new way to query JSON data
Built-in Functions for JSON documents

MySQL/InnoDB Enhancements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly