メインコンテンツまでスキップ
バージョン: User Guides (Cloud)

BM25 Function

The BM25 function enables full text search by transforming raw text into sparse vectors and scoring documents based on lexical relevance. It applies term-based matching and frequency-aware weighting to support efficient retrieval of text documents that closely match query terms.

As a local text function, the BM25 function runs within Zilliz Cloud and does not require model inference or external integrations. It provides a deterministic and transparent retrieval mechanism for text-based search scenarios.

How BM25 works

The BM25 algorithm is a term-based relevance scoring algorithm widely used in full text retrieval. In Zilliz Cloud, BM25 is implemented as a sparse retrieval pipeline that converts text into term-weight representations and retrieves top K documents using distributed sparse indexes.

The overall workflow consists of two symmetric paths: document ingestion and query text processing, which share the same text analysis logic.

Document ingestion: From text to sparse representation

When a document is inserted, its raw text is first processed by an analyzer, which tokenizes the text into individual terms.

For example, the document:

"We are loving Milvus!"

can be analyzed into the following terms:

["we", "love", "milvus"]

Each document is then represented as a term frequency (TF) representation, which records how many times each term appears in the document. For example:

{
"we": 1,
"love": 1,
"milvus": 1
}

At the same time, Zilliz Cloud updates corpus-level statistics, including:

  • the document frequency (DF) of each term

  • the average document length

  • posting lists that map each term to the documents containing it

The document's TF representation is inserted into sparse embeddings, where term postings are partitioned across nodes for scalable retrieval.

Query text process: Apply IDF weighting

When a text-based query is issued, it is processed by the same analyzer used during document ingestion, ensuring consistent term segmentation.

For example, the query:

"who loves Milvus?"

can be analyzed into:

["who", "love", "milvus"]

For each query term, Zilliz Cloud looks up its inverse document frequency (IDF) from corpus statistics. IDF reflects how informative a term is across the entire dataset: rarer terms receive higher weights, while common terms receive lower weights.

Conceptually, this produces a set of IDF-weighted query terms, such as:

{
"who": 0.1,
"love": 0.5,
"milvus": 1.2
}

BM25 scoring and top K retrieval

BM25 ranks documents by computing a relevance score based on matched query terms. Scoring is performed at the term level and aggregated at the document level.

Term-level scoring

For each query term that appears in a document, BM25 computes a term-level score:

term_score =
IDF(term) ×
TF_boost(term, document, k1) ×
length_normalization(document, b)

Where:

  • IDF(term) reflects how rare the term is in the collection

  • TF_boost(…, k1) increases with term frequency but saturates as frequency grows

  • length_normalization(…, b) adjusts the score based on document length

Document-level scoring and Top-K retrieval

The final document score is the sum of term-level scores for all matched query terms:

document_score =
sum of term_score over all matched query terms

Documents are ranked by their final scores, and the top-K highest-scoring documents are returned.

Before you start

Before using the BM25 function, plan your collection schema to ensure it supports lexical full text search:

  • A text field for raw content

    Your collection must include a VARCHAR field to store raw text. This field is the source of text that will be processed for full text search.

  • An analyzer for the text field

    The text field must have an analyzer enabled. The analyzer defines how text is tokenized and normalized before lexical relevance is computed by the BM25 function.

    By default, Zilliz Cloud provides a built-in analyzer that tokenizes text based on whitespace and punctuation. If your application requires custom tokenization or normalization behavior, you can define a custom analyzer. See Choose the Right Analyzer for Your Use Case for details.

  • A sparse vector for BM25 output

    Your collection must include a SPARSE_FLOAT_VECTOR field to store the sparse representations generated by the BM25 function. This field is used for indexing and retrieval during full text search.

After these schema-level considerations are figured out, proceed to create the collection and use the BM25 function.

Step 1: Create a collection with a BM25 function

To use the BM25 function, you must define it when creating the collection. The function becomes part of the collection schema and is applied automatically during data insertion and search.

Via SDK

Define schema fields

Your collection schema must include at least three required fields:

  • Primary field: Uniquely identifies each entity in the collection.

  • Text field (VARCHAR): Stores raw text documents. Must set enable_analyzer=True so Zilliz Cloud can process the text for BM25 relevance ranking. By default, Zilliz Cloud uses the standard analyzer for text analysis. To configure a different analyzer, refer to Analyzer Overview.

  • Sparse vector field (SPARSE_FLOAT_VECTOR): Stores sparse embeddings automatically generated by the BM25 function.

from pymilvus import MilvusClient, DataType, Function, FunctionType

client = MilvusClient(
uri="YOUR_CLUSTER_ENDPOINT",
token="YOUR_CLUSTER_TOKEN"
)

schema = client.create_schema()

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True) # Primary field
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True) # Text field
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR) # Sparse vector field; no dim required for sparse vectors

Define the BM25 function

The BM25 function converts tokenized text into sparse vectors that support BM25 scoring.

Define the function and add it to your schema:

bm25_function = Function(
name="text_bm25_emb", # Function name
input_field_names=["text"], # Name of the VARCHAR field containing raw text data
output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
function_type=FunctionType.BM25, # Set to `BM25`
)

schema.add_function(bm25_function)

Configure the index

After defining the schema with necessary fields and the built-in function, set up the index for your collection. To simplify this process, use AUTOINDEX as the index_type, an option that allows Zilliz Cloud to choose and configure the most suitable index type based on the structure of your data.

index_params = client.prepare_index_params()

index_params.add_index(
field_name="sparse",

index_type="AUTOINDEX",
metric_type="BM25"

)

Create the collection

Now create the collection using the schema and index parameters defined:

client.create_collection(
collection_name='my_collection',
schema=schema,
index_params=index_params
)

Via web console

Alternatively, you can create a collection with a BM25 function in the Zilliz Cloud console.

Once the collection with a BM25 function is created, you can insert text and perform lexical searches based on text query.

Step 2: Insert text data into the collection

After setting up your collection and index, you're ready to insert text data. In this process, you need only to provide the raw text. The BM25 function we defined earlier automatically generates the sparse vector for each text entry.

client.insert('my_collection', [
{'text': 'information retrieval is a field of study.'},
{'text': 'information retrieval focuses on finding relevant information in large datasets.'},
{'text': 'data mining and information retrieval overlap in research.'},
])

Step 3: Search with text query

Once you've inserted data into your collection, you can perform full text searches using raw text queries. Zilliz Cloud automatically converts your query into a sparse vector and ranks the matched search results using the BM25 algorithm, and then returns the topK (limit) results.

search_params = {
'params': {'level': 10},
}

res = client.search(
collection_name='my_collection',
data=['whats the focus of information retrieval?'],
anns_field='sparse',
output_fields=['text'], # Fields to return in search results; sparse field cannot be output
limit=3,
search_params=search_params
)

print(res)