Skip to main content
Version: User Guides (Cloud)

Grouping Search

A grouping search allows Zilliz Cloud to group the search results by the values in a specified field to aggregate data at a higher level. For example, you can use a basic ANN search to find books similar to the one at hand, but you can use a grouping search to find the book categories that may involve the topics discussed in that book. This topic describes how to use Grouping Search along with key considerations.

Overview

When entities in the search results share the same value in a scalar field, this indicates that they are similar in a particular attribute, which may negatively impact the search results.

Assume a collection stores multiple documents (denoted by docId). To retain as much semantic information as possible when converting documents into vectors, each document is split into smaller, manageable paragraphs (or chunks) and stored as separate entities. Even though the document is divided into smaller sections, users are often still interested in identifying which documents are most relevant to their needs.

LhJEwzWiphLWxobMaiCcbVDPnNb

When performing an Approximate Nearest Neighbor (ANN) search on such a collection, the search results may include several paragraphs from the same document, potentially causing other documents to be overlooked, which may not align with the intended use case.

Ktj8wigrHhvz4nbDES5coKZJnZe

To improve the diversity of search results, you can add the group_by_field parameter in the search request to enable Grouping Search. As shown in the diagram, you can set group_by_field to docId. Upon receiving this request, Zilliz Cloud will:

  • Perform an ANN search based on the provided query vector to find all entities most similar to the query.

  • Group the search results by the specified group_by_field, such as docId.

  • Return the top results for each group, as defined by the limit parameter, with the most similar entity from each group.

This section provides example code to demonstrate the use of Grouping Search. The following example assumes the collection includes fields for id, vector, chunk, and docId.

[
{"id": 0, "vector": [0.3580376395471989, -0.6023495712049978, 0.18414012509913835, -0.26286205330961354, 0.9029438446296592], "chunk": "pink_8682", "docId": 1},
{"id": 1, "vector": [0.19886812562848388, 0.06023560599112088, 0.6976963061752597, 0.2614474506242501, 0.838729485096104], "chunk": "red_7025", "docId": 5},
{"id": 2, "vector": [0.43742130801983836, -0.5597502546264526, 0.6457887650909682, 0.7894058910881185, 0.20785793220625592], "chunk": "orange_6781", "docId": 2},
{"id": 3, "vector": [0.3172005263489739, 0.9719044792798428, -0.36981146090600725, -0.4860894583077995, 0.95791889146345], "chunk": "pink_9298", "docId": 3},
{"id": 4, "vector": [0.4452349528804562, -0.8757026943054742, 0.8220779437047674, 0.46406290649483184, 0.30337481143159106], "chunk": "red_4794", "docId": 3},
{"id": 5, "vector": [0.985825131989184, -0.8144651566660419, 0.6299267002202009, 0.1206906911183383, -0.1446277761879955], "chunk": "yellow_4222", "docId": 4},
{"id": 6, "vector": [0.8371977790571115, -0.015764369584852833, -0.31062937026679327, -0.562666951622192, -0.8984947637863987], "chunk": "red_9392", "docId": 1},
{"id": 7, "vector": [-0.33445148015177995, -0.2567135004164067, 0.8987539745369246, 0.9402995886420709, 0.5378064918413052], "chunk": "grey_8510", "docId": 2},
{"id": 8, "vector": [0.39524717779832685, 0.4000257286739164, -0.5890507376891594, -0.8650502298996872, -0.6140360785406336], "chunk": "white_9381", "docId": 5},
{"id": 9, "vector": [0.5718280481994695, 0.24070317428066512, -0.3737913482606834, -0.06726932177492717, -0.6980531615588608], "chunk": "purple_4976", "docId": 3},
]

In the search request, set both group_by_field and output_fields to docId. Zilliz Cloud will group the results by the specified field and return the most similar entity from each group, including the value of docId for each returned entity.

from pymilvus import MilvusClient

client = MilvusClient(
uri="YOUR_CLUSTER_ENDPOINT",
token="YOUR_CLUSTER_TOKEN"
)

query_vectors = [
[0.14529211512077012, 0.9147257273453546, 0.7965055218724449, 0.7009258593102812, 0.5605206522382088]]

# Group search results
res = client.search(
collection_name="group_search_collection",
data=query_vectors,
limit=3,
group_by_field="docId",
output_fields=["docId"]
)

# Retrieve the values in the `docId` column
doc_ids = [result['entity']['docId'] for result in res[0]]

In the request above, limit=3 indicates that the system will return search results from three groups, with each group containing the single most similar entity to the query vector.

Considerations

  • Number of groups: The limit parameter controls the number of groups from which search results are returned, rather than the specific number of entities within each group. Setting an appropriate limit helps control search diversity and query performance. Reducing limit can reduce computation costs if data is densely distributed or performance is a concern.