reindexer

Creation

Reindexer supports three types of vector indexes: hnsw (based on HNSWlib), ivf (based on FAISS IVF FLAT) and brute force (vec_bf), and three types of metrics for calculating the measure of similarity of vectors: inner_product, l2 and cosine.

For all types of vector indexes, the metric and the dimension should be explicitly specified. Vectors of only the specified dimension can be inserted and searched in the index.

Optionally radius could be specified for filtering vectors by metric.

The initial size start_size can optionally be specified for brute force and hnsw indexes, which helps to avoid reallocation and reindexing. The optimal value is equal to the size of the fully filled index. A much larger start_size value will result in memory overuse, while a much smaller start_size value will slow down inserts. Minimum and default values are 1000.

Automatic embedding of vector indexes is also supported. It is expected that the vector generation service is configured. Its URL is needed, and the basic index fields are specified. Contents of the base fields are passed to the service, and the service returns the calculated vector value.

HNSW options

Hnsw index is also configured with the following parameters:

IVF options

For ivf index, the number of vectors (called centroids) should be specified, that will be chosen to partition the entire set of vectors into clusters. Each vector belongs to the cluster of the centroid closest to itself. The higher the centroids_count, the fewer vectors each cluster contains, this will speed up the search but will slow down the index building.
Required, range of values is [1, 65536]. It is recommended to take values of the order between $4 * \sqrt{number: of: vectors: in: the: index}$ and $16 * \sqrt{number: of: vectors: in: the: index}$.

Examples

// For a vector field, the data type must be array or slice of `float32`.
type Item struct {
	Id      int           `reindex:"id,,pk"`
	// In case of a slice, `dimension` should be explicitly specified in the field tag.
	VecBF   []float32     `reindex:"vec_bf,vec_bf,start_size=1000,metric=l2,dimension=1000"`
	// In case of an array, `dimension` is taken to be equal to the size of the array.
	VecHnsw [2048]float32 `reindex:"vec_hnsw,hnsw,m=16,ef_construction=200,start_size=1000,metric=inner_product,multithreading=1"`
	VecIvf  [1024]float32 `reindex:"vec_ivf,ivf,centroids_count=80,metric=cosine,radius=0.5"`
}

When adding a vector index to an existing namespace, the field on which the index will be built must be empty or contain an array of numbers of length equal to the dimension of the index.

ivfOpts := reindexer.FloatVectorIndexOpts {
	Metric:         "l2",
	Dimension:      1024,
	CentroidsCount: 32,
	Radius:         1e20,
}
indexDef := reindexer.IndexDef {
	Name:       "vec",
	JSONPaths:  []string{"vec"},
	IndexType:  "ivf",
	FieldType:  "float_vector",
	Config:     ivfOpts,
}
err := DB.AddIndex("ns_name", indexDef)
if err != nil {
	panic(err)
}

Embedding configuration

Reindexer is able to perform automatic remote HTTP API calls to receive embedding for documents’ fields or strings in KNN queries conditions. Currently, reindexer’s core simply sends fields/conditions content to external user’s service and expects to receive embedding results.

Embedding service has to implement this openapi spec.

Notice: current embedding callback API is in beta and may be changed in the next releases

To configure automatic embedding you should set config field in the target vector index:

"config": {
  "embedding": {
    "upsert_embedder": {
      "name": <Embedder name>
      "URL": <URL service>,
      "cache_tag": <name, used to access the cache>,
      "fields": [ "idx1", "idx2" ]
      "embedding_strategy": <"always"|"empty_only"|"strict">
      "pool": {
        "connections": 10,
        "connect_timeout_ms": 300,
        "read_timeout_ms": 5000,
        "write_timeout_ms": 5000
      }
    },
    "query_embedder": {
      "name": <Embedder name>
      "URL": <URL service>,
      "cache_tag": <name, used to access the cache>,
      "pool": {
        "connections": 10,
        "connect_timeout_ms": 300,
        "read_timeout_ms": 5000,
        "write_timeout_ms": 5000
      }
    }
  }
}

It is also optionally possible to configure a connection pool:

Upsert embedder used in Insert/Update/Upsert operations, send format is json: /api/v1/embedder/NAME/produce?format=json. Query embedder starts with WhereKNN, sending a string as the search value (?format=text). The embedding process: sends JSON values for all fields involved in the embedding to the specified URL. For one requested vector:

{"data":[{"field0":val0, "field1":val1, "field2":[val20, val21, ...], ...}]

Or a batch for several:

{"data":[{"field0":val0, "field1":[val10, val11], ...}, ..., {"field0":val0, "field1":[], ...}]}

Query embedder format for single and for package for multiple:

{"data":["WhereKNN input search text", ...]}

For information on request/response formats, openapi spec. As a response, the produce should always return an array of arrays of objects to support subsequent chunking:

[
  "_comment": "One array corresponds to one object/string that came to produce",
  [
    "_comment": "One such object corresponds to one chunk. At this stage, there should always be one chunk, and the data from the 'chunk' itself will be ignored - only the vector from the embedding field will be used",
    {
      "chunk": "some data",
      "embedding": [ 1.1, 0.7, ...]
    },
    {
      "chunk": "more data",
      "embedding": [ 0.1, -1.0, ...]
    },
    ...
  ],
  ...
]

Embedding cache configuration

When doing embedding, it makes sense to use result caching, which can improve performance. To do this, you need to configure it. Setting up embedding caches is done in two places. Firstly, cache_tag parameter that was described above. cache_tag is part of the config field in the target vector index description. cache_tag is a simple name/identifier, used to access the cache. Optional, if not specified, caching is not used. The name may not be unique. In this case, different embedders can put the result in the same cache. But keep in mind that this only works well if the source data for the embedder does not overlap. Or if the embedders return exactly the same values for the same request. Secondly, special item in system #config namespace, with type embedders. Optional, if not specified, caching is not used for all embedders.

{
  "type":"embedders",
  "caches":[
    {
      "cache_tag":"*",
      "max_cache_items":1000000,
      "hit_to_cache":1
    },
    {
      "cache_tag":"the jungle book",
      "max_cache_items":2025,
      "hit_to_cache":3
    }
  ]
}

Quantization Configuration for HNSW Index

The HNSW index supports vector quantization. Quantization helps reduce the memory footprint of the index (by approximately 75%) and lowers the computational cost of search. At the moment, only 8-bit scalar quantization is supported.

Quantization is configured via the quantization_config object in the index definition under the config field and is currently available only for HNSW indexes.

Configuration Example

{
  "indexes": [
    {
      "name": "some_hnsw_index",
      ///...
      "config": {
        "dimension": 1024,
        "metric": "cosine",
        ///...
        "quantization_config": {
          "quantization_type": "scalar_quantization_8_bit",
          "quantile": 0.99,
          "sample_size": 30000,
          "quantization_threshold": 150000
        },
        "embedding": {
          ///...
        }
      }
    }
  ]
}

quantization_config Fields

Field Type Default Value Description
quantization_type string Quantization type. Currently, only the scalar_quantization_8_bit value is supported, which corresponds to 8-bit scalar quantization. If the quantization_config object is absent from the index configuration, quantization is disabled
quantile number computed automatically Quantile used to determine the clipping range of vector components before quantization. Allowed values range from 0.95 to 1.0. The default value is computed automatically based on the dimensionality of vectors in the index. It is recommended to change this parameter only if the distribution of vector component values is known and additional search quality tuning is required, for example to achieve the expected recall
sample_size integer 20000 Number of vectors sampled from the index to build a sample used for estimating the quantization range. Minimum and maximum component values are computed on this sample with quantile taken into account, after which the effective quantization range is determined
quantization_threshold integer 100000 Minimum number of vectors in the index required to trigger background quantization

Default Behavior

To enable quantization, it is enough to specify only quantization_type (tag reindex:"quantization=sq8" in the Go connector):

{
  "quantization_config": {
    "quantization_type": "scalar_quantization_8_bit"
  }
}
type Item struct {
	Id      int           `reindex:"id,,pk"`
	VecHnsw [1024]float32 `reindex:"hnsw_idx,hnsw,m=16,ef_construction=200,metric=inner_product,multithreading=1,start_size=1000,quantization=sq8"`
}

All other parameters will use their default values:

Updating and Disabling Quantization

The user can update the parameters of an existing quantization config (quantile, sample_size, quantization_threshold), but only until the index has been quantized:

err := DB.UpdateIndex(ns, reindexer.IndexDef{
  Name:      hnswIndexName,
  IndexType: "hnsw",
  FieldType: "float_vector",
  ///...
  Config: reindexer.FloatVectorIndexOpts{
    ///...
    QuantizationConfig: &bindings.QuantizationConfig{
      Type:       "scalar_quantization_8_bit",
      Quantile:   0.987,
      SampleSize: 200000,
      Threshold:  1000000,
    },
  },
})

Otherwise, an error will be returned, and the new configuration can only be applied after resetting the old one.

The absence of quantization_config means the index operates without quantization.
To disable quantization, the index must be updated by removing the quantization_config object from its configuration.

If quantization has already been performed for this index, removing this section is treated as resetting the quantization configuration and reloading the original float vector values.

Float vector fields in selection results

By default, float vector fields are excluded from the results of all queries to namespaces containing vector indexes. If you need to get float vector fields, you should specify this explicitly in the query. Either by listing the required fields, or by requesting the output of all vectors().

Supported filtering operations on floating-point vector fields are KNN, Empty, and Any. It is not possible to use multiple KNN filters in a query, and it is impossible to combine filtering by KNN and fulltext.

Parameters set for a KNN query depend on the specific index type. It is required to specify k or radius (or both) for every index type. k is the maximum number of documents returned from the index for subsequent filtering.

In addition to the parameter k (or instead it), the query results can also be filtered by a rank-value using the parameter radius. It’s named so because, under the L2-metric, it restricts vectors from query result to a sphere of the specified radius.

Note: To avoid confusion, in case of L2-metric, for performance optimization, the ranks of vectors in the query result are actually squared distances to the query vector. Thus, while the parameter is called radius, the passed value is interpreted as the squared distance.

When searching by hnsw index, you can additionally specify the ef parameter. Increasing this parameter allows you to get a higher quality result (recall rate), but at the same time slows down the search. See description here. Optional, minimum and default values are k.

When searching by ivf index, you can additionally specify the nprobe parameter - the number of clusters to be looked at during the search. Increasing this parameter allows you to get a higher quality result (recall rate), but at the same time slows down the search. Optional, should be greater than 0, default value is 1.

KNN search with auto-embedding

KNN search with automatic embedding works much like simple KNN search. With one exception, it expects a string instead of a vector. It then sends that string to the embedding service, gets back a calculated vector. And that vector is then used as the actual value for filtering the database. Before that, you need to configure the “query_embedder”.

// hnsw hnswSearchParams, err := reindexer.NewIndexHnswSearchParam(100000, knnBaseSearchParams) if err != nil { panic(err) } it := db.Query(“test_ns”).WhereKnnString(“vec_hnsw”, “", hnswSearchParams).Exec() defer it.Close()

// ivf ivfSearchParams, err := reindexer.NewIndexIvfSearchParam(10, knnBaseSearchParams) if err != nil { panic(err) } it := db.Query(“test_ns”).WhereKnnString(“vec_ivf”, “", ivfSearchParams).Exec() defer it.Close()

-HTTP
```http request
curl --location --request POST 'http://127.0.0.1:9088/api/v1/db/vectors_db/query' \
--header 'Content-Type: application/json' \
--data-raw '{
  "namespace": "test_ns",
  "type": "select",
  "filters": [
    {
      "op": "and",
      "cond": "knn",
      "field": "vec_bf",
      "value": "<text to embed calculate>",
      "params": {"k": 100}
    }
  ],
}'

Rank

By default, the results of queries with KNN are sorted by rank, that is equal to requested metric values. For indexes with the l2 metric, from a lower value to a higher value, and with the inner_product and cosine metrics, from a higher value to a lower value. This is consistent with the best match for the specified metrics.

When it is necessary to get the rank-value of each document in the query result, it must be requested explicitly via the RANK() function in SQL or WithRank() in GO:

SELECT *, RANK() FROM test_ns WHERE KNN(vec_bf, [2.4, 3.5, ...], k=200)
knnBaseSearchParams := reindexer.BaseKnnSearchParam{}.SetK(200)
db.Query("test_ns").WithRank().WhereKnn("vec_bf", []float32{2.4, 3.5, ...}, knnBaseSearchParams)

Result:

{"id": 0, "rank()": 1.245}

rank can also be used to sort by expression. Described in detail here.

Query examples

Environment variables affecting vector indexes

Additional action commands

These commands can be used by inserting them via upsert into the #config namespace.

Rebuilding clusters for IVF index

{"type":"action","action":{"command":"rebuild_ivf_index", "namespace":"*", "index":"*", "data_part": 0.5}}

The command can be useful for cases where the composition of vectors in the index has changed significantly and the current centroids do not provide sufficiently high-quality output.

Removing disk cache for ANN indexes

{"type":"action","action":{"command":"drop_ann_storage_cache", "namespace":"*", "index":"*"}}

The command can be useful for cases when you need to force the re-creation of the disk cache for ANN indexes, or disable it completely (using it together with the RX_DISABLE_ANN_CACHE environment variable).

Create embedding for existing documents

Removing disk cache for embedders