Skip to main content
Version: User Guides (Cloud)

Sparse Vector

Sparse vectors are an important method of data representation in information retrieval and natural language processing. While dense vectors are popular for their excellent semantic understanding capabilities, sparse vectors often provide more accurate results when it comes to applications that require precise matching of keywords or phrases.

Overview

A sparse vector is a special representation of high-dimensional vectors where most elements are zero, and only a few dimensions have non-zero values. As shown in the diagram below, dense vectors are typically represented as continuous arrays where each position has a value (e.g., [0.3, 0.8, 0.2, 0.3, 0.1]). In contrast, sparse vectors store only non-zero elements and their indices, often represented as key-value pairs (e.g., [{2: 0.2}, ..., {9997: 0.5}, {9999: 0.7}]).

VPhswBhHmhJrh3byaVnc3onYnPc

Sparse vectors reduce storage and boost efficiency in high-dimensional data, making them ideal for large & sparse datasets. Common use cases include:

  • Text analysis: Representing documents as bag-of-words vectors, where each dimension corresponds to a word in the vocabulary, and only words present in the document have non-zero values.

  • Recommendation systems: User-item interaction matrices, where each dimension represents a user’s rating for a particular item, with most users interacting with only a few items.

  • Image processing: Local feature representation, focusing only on key points in the image, resulting in high-dimensional sparse vectors.

Sparse vectors are commonly generated through traditional statistical methods like TF-IDF and BM25, or via neural models that learn sparse representations from text. Zilliz Cloud supports full-text search for text data using the BM25 algorithm. It automatically converts text into sparse embeddings — no manual vectorization required. Refer to Full Text Search for more information.

Once vectorized, the data can be stored in Zilliz Cloud for efficient management and retrieval. The diagram below illustrates the overall process.

A7FvwnB5bhpBlKbgrzYcQijbnxg

📘Notes

In addition to sparse vectors, Zilliz Cloud also supports dense vectors and binary vectors. Dense vectors are ideal for capturing deep semantic relationships, while binary vectors excel in scenarios like quick similarity comparisons and content deduplication. For more information, refer to Dense Vector and Binary Vector.

Data Formats

Zilliz Cloud supports representing sparse vectors in any of the following formats:

  • Sparse Matrix (using the scipy.sparse class)

    from scipy.sparse import csr_matrix

    # Create a sparse matrix
    row = [0, 0, 1, 2, 2, 2]
    col = [0, 2, 2, 0, 1, 2]
    data = [1, 2, 3, 4, 5, 6]
    sparse_matrix = csr_matrix((data, (row, col)), shape=(3, 3))

    # Represent sparse vector using the sparse matrix
    sparse_vector = sparse_matrix.getrow(0)
  • List of Dictionaries (formatted as {dimension_index: value, ...})

    # Represent sparse vector using a dictionary
    sparse_vector = [{1: 0.5, 100: 0.3, 500: 0.8, 1024: 0.2, 5000: 0.6}]
  • List of Tuple Iterators (formatted as [(dimension_index, value)])

    # Represent sparse vector using a list of tuples
    sparse_vector = [[(1, 0.5), (100, 0.3), (500, 0.8), (1024, 0.2), (5000, 0.6)]]

Define Collection Schema

Before creating a collection, you need to define the collection schema, which includes fields and derivative functions that convert text fields into corresponding sparse vector representations.

Add fields

To use sparse vectors in Zilliz Cloud clusters, you need to create a collection with a schema including at least the following fields:

  • A SPARSE_FLOAT_VECTOR field reserved for storing sparse embeddings, either auto-generated from a VARCHAR field or provided directly by the inserted data.

  • (For built-in BM25) A VARCHAR field for raw text documents with enable_analyzer set to True.

from pymilvus import MilvusClient, DataType

client = MilvusClient(uri="YOUR_CLUSTER_ENDPOINT")

schema = client.create_schema(
auto_id=True,
enable_dynamic_fields=True,
)

schema.add_field(field_name="pk", datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="sparse_vector", datatype=DataType.SPARSE_FLOAT_VECTOR)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)

In this example, three fields are added:

  • pk: This field stores primary keys using the VARCHAR data type, which is auto-generated with a maximum length of 100 bytes.

  • sparse_vector: This field stores sparse vectors using the SPARSE_FLOAT_VECTOR data type.

  • text: This field stores text strings using the VARCHAR data type, with a maximum length of 1000 bytes.

Add functions

📘Notes

This step is required if you need Zilliz Cloud to generate sparse vector embeddings based on the value in a specified text field during data insertion. You can skip this step if you decide to bring your own vectors.

To utilize the full-text search feature built in Zilliz Cloud, powered by BM25 (eliminating the need for manual generation of sparse embeddings), you need to add the Function to the schema:

from pymilvus import Function, FunctionType

bm25_function = Function(
name="text_bm25_emb",
input_field_names=["text"],
output_field_names=["sparse"],
function_type=FunctionType.BM25,
)

schema.add_function(bm25_function)

For more details, see Full Text Search.

Set Index Parameters

The process of creating an index for sparse vectors is similar to that for dense vectors, but with differences in the specified index type (index_type), distance metric (metric_type), and index parameters (params).

index_params = client.prepare_index_params()

index_params.add_index(
field_name="sparse_vector",
index_name="sparse_auto_index",
index_type="AUTOINDEX",
metric_type="BM25" # or "IP" for custom sparse vectors
)

This example uses the SPARSE_INVERTED_INDEX index type with BM25 as the metric. For more details, see the following resources:

Create Collection

Once the sparse vector and index settings are complete, you can create a collection that contains sparse vectors. The example below uses the create_collection method to create a collection named my_collection.

client.create_collection(
collection_name="my_collection",
schema=schema,
index_params=index_params
)

Insert data

You must provide data for all fields defined during collection creation, except for fields that are auto-generated (such as the primary key with auto_id enabled). If you are using the built-in BM25 function to auto-generate sparse vectors, you should also omit the sparse vector field when inserting data.

data = [
{
"text": "information retrieval is a field of study.",
# "sparse_vector": {1: 0.5, 100: 0.3, 500: 0.8} # Do NOT provide sparse vectors if using built-in BM25
},
{
"text": "information retrieval focuses on finding relevant information in large datasets.",
# "sparse_vector": {10: 0.1, 200: 0.7, 1000: 0.9} # Do NOT provide sparse vectors if using built-in BM25
},
]

client.insert(
collection_name="my_collection",
data=data
)

To perform a similarity search using sparse vectors, prepare both the query data and the search parameters. If you are using the built-in BM25 function, simply provide the query text — there is no need to supply a sparse vector.

# Prepare search parameters
search_params = {
"params": {"drop_ratio_search": 0.2}, # A tunable drop ratio parameter with a valid range between 0 and 1
}

# Query with text if search with the built-in BM25
query_data = ["What is information retrieval?"]

# Otherwise, query with the sparse vector
# query_data = [{1: 0.2, 50: 0.4, 1000: 0.7}]

Then, execute the similarity search using the search method:

res = client.search(
collection_name="my_collection",
data=query_data,
limit=3,
output_fields=["pk"],
search_params=search_params,
)

print(res)

# Output
# data: ["[{'id': '453718927992172266', 'distance': 0.6299999952316284, 'entity': {'pk': '453718927992172266'}}, {'id': '453718927992172265', 'distance': 0.10000000149011612, 'entity': {'pk': '453718927992172265'}}]"]

For more information on similarity search parameters, refer to Basic Vector Search.