Quickstart to On-Demand Search with External CollectionPublic Preview
On-demand search lets you search massive datasets with zero-copy access to data in external storage or imported into Zilliz Cloud, without keeping compute resources running continuously. You can create collections from external volumes or imported files, build indexes and refresh metadata via the project data plane endpoint, and start an on-demand cluster only when you need to run search or query workloads.
To do so, the procedure is as follows:
Before you start
-
Create storage integration.
A storage integration is a profile that records your data location with access credentials. To set up storage integration, follow the steps to create an AWS S3, Google GCS, or Azure storage integration and obtain the storage integration ID.
-
Create an external volume.
An external volume is a path within storage integration. Ensure that your raw data is on that path. You can create multiple external volumes from the same storage integration. To create an external volume, refer to External Volumes.
Step 1: Connect to a project endpoint.
Before working on a database, connect to the project endpoint. You can obtain the project endpoint on the quickstart page after enabling on-demand compute on the Zilliz Cloud console.
External collection operations require an API key for authentication. This flow does not support username:password authentication.
- Python
- cURL
# connect to database
client = MilvusClient(
# a project-specific on-demand compute endpoint
uri="https://{project-id}.{region}.api.zillizcloud.com",
token="YOUR_API_KEY"
)
export PROJECT_ENDPOINT="https://{project-id}.{region}.api.zillizcloud.com"
Step 2: (Optional) Create a database.
Zilliz Cloud ships with a default database. If you choose that, skip this step. You can also create a database as follows.
- Python
- cURL
client.create_database(
db_name="my_database"
)
curl --request POST \
--url "${PROJECT_ENDPOINT}/v2/vectordb/databases/create" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
-d '{
"dbName": "my_database"
}'
Step 3: Create an external collection.
Once the database is ready, you can create external collections in it. An external collection maps its columns to the data files you specify and attaches on-demand compute resources for the searches in that collection.
Unlike managed collections that require you to import your raw data into the collection, external collections generate metadata from your raw data via sub-second refresh operations.
The following example demonstrates how to set up the mapping relationship between collection fields and your data files. When initiating the schema, pass in the volume path and file format of your data.
- Python
- cURL
from pymilvus import MilvusClient, DataType
schema = MilvusClient.create_schema(
external_source='volume://my_volume/iceberg/metadata/00001-xxx.metadata.json',
external_spec='{
"format": "iceberg-table",
"snapshot_id": "1234567890123456789"
}'
)
schema.add_field(
field_name="vector",
datatype=DataType.FLOAT_VECTOR,
dim=1536,
# highlight-next
external_field="embedding" # field name in the external data file
)
schema.add_field(
field_name="product_id",
datatype=DataType.VARCHAR,
max_length=32,
nullable=True,
# highlight-next
external_field="product_id"
)
schema.add_field(
field_name="title",
datatype=DataType.VARCHAR,
max_length=512,
nullable=True,
# highlight-next
external_field="title"
)
schema.add_field(
field_name="main_category",
datatype=DataType.VARCHAR,
max_length=64,
nullable=True,
# highlight-next
external_field="main_category"
)
schema.add_field(
field_name="price",
datatype=DataType.DOUBLE,
nullable=True,
# highlight-next
external_field="price"
)
schema.add_field(
field_name="average_rating",
datatype=DataType.DOUBLE,
nullable=True,
# highlight-next
external_field="average_rating"
)
schema.add_field(
field_name="rating_number",
datatype=DataType.INT64,
nullable=True,
# highlight-next
external_field="rating_number"
)
export schema='{
"externalSource": "volume://my_volume/iceberg/metadata/00001-xxx.metadata.json",
"externalSpec": "{\"format\": \"iceberg-table\", \"snapshot_id\": \"1234567890123456789\"}",
"fields": [
{
"fieldName": "vector",
"dataType": "FloatVector",
"elementTypeParams": {
"dim": "1536"
},
"externalField": "embedding"
},
{
"fieldName": "product_id",
"dataType": "VarChar",
"elementTypeParams": {
"max_length": "32"
},
"nullable": true,
"externalField": "product_id"
},
{
"fieldName": "title",
"dataType": "VarChar",
"elementTypeParams": {
"max_length": "512"
},
"nullable": true,
"externalField": "title"
},
{
"fieldName": "main_category",
"dataType": "VarChar",
"elementTypeParams": {
"max_length": "64"
},
"nullable": true,
"externalField": "main_category"
},
{
"fieldName": "price",
"dataType": "Double",
"nullable": true,
"externalField": "price"
},
{
"fieldName": "average_rating",
"dataType": "Double",
"nullable": true,
"externalField": "average_rating"
},
{
"fieldName": "rating_number",
"dataType": "Int64",
"nullable": true,
"externalField": "rating_number"
}
]
}'
Then you can create a collection with the above schema. If you decide to use the default database, you can safely skip the db_name parameter.
- Python
- cURL
client.use_database(
db_name="my_database"
)
# create the collection
client.create_collection(
collection_name="my_collection",
schema=schema
)
curl --request POST \
--url "${PROJECT_ENDPOINT}/v2/vectordb/collections/create" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
-d "{
\"dbName\": \"my_database\",
\"collectionName\": \"my_collection\",
\"schema\": $schema
}"
Step 4: Create indexes and refresh the collection.
You can create indexes in an external database as you do in managed collections. All vector fields should be indexed, and you can select to index some scalar fields for fast metadata filtering. However, you need to call refresh to build the index.
- Python
- cURL
index_params = client.prepare_index_params()
# Add indexes
index_params.add_index(
field_name="vector",
index_type="AUTOINDEX",
metric_type="COSINE"
)
index_params.add_index(
field_name="main_category",
index_type="AUTOINDEX"
)
client.create_index(
db_name="my_database",
collection_name="my_collection",
index_params=index_params
)
export indexParams='[
{
"fieldName": "vector",
"metricType": "COSINE",
"indexName": "vector",
"indexType": "AUTOINDEX"
},
{
"fieldName": "main_category",
"indexName": "main_category",
"indexType": "AUTOINDEX"
}
]'
curl --request POST \
--url "${PROJECT_ENDPOINT}/v2/vectordb/indexes/create" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
-d "{
\"dbName\": \"my_database\",
\"collectionName\": \"my_collection\",
\"indexParams\": $indexParams
}"
Then refresh the external collection. You can omit externalSource and externalSpec to reuse the collection schema, or provide both to refresh the collection schema from a new source.
- Python
- cURL
# refresh the external database
job_id = client.refresh_external_collection(
collection_name="my_collection"
)
# Refresh the external collection
curl --request POST \
--url "${PROJECT_ENDPOINT}/v2/vectordb/jobs/external_collection/refresh" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
-d '{
"dbName": "default",
"collectionName": "my_collection"
}'
# job-xxxxxxxxxxxxxxxxxxx
Then you can create a loop to wrap the progress-monitoring calls and track the refresh operation's progress.
- Python
- cURL
progress = client.get_refresh_external_collection_progress(job_id=job_id)
curl -s --request POST \
--url "${PROJECT_ENDPOINT}/v2/vectordb/jobs/external_collection/describe" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
-d '{
"jobId": "job-xxxxxxxxxxxxxxxxxxx"
}'
Step 5: Create an on-demand cluster
Once your external collection is ready, you need to attach it to an on-demand cluster for on-demand searches. The following command creates a cluster and returns its ID.
export CONTROL_PLANE_ENDPOINT="https://api.cloud.zilliz.com"
curl --request POST \
--url "${CONTROL_PLANE_ENDPOINT}/v2/clusters/createOnDemandCluster" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
-d '{
"projectId": "proj-xxxxxxxxxxxxxxxxxxx",
"regionId": "aws-us-west-2",
"clusterName": "my-on-demand",
"cuSize": 8,
"autoSuspend": 60
}'
# inxx-xxxxxxxxxxxxx
Step 6: Conduct searches.
When you need to conduct searches, queries, or hybrid searches, you can attach to the on-demand cluster created in the previous step through a session.
- Python
- cURL
session = client.session(
cluster_id="inxx-xxxxxxxxxxxxx"
)
# 1536-dimensional vector
query_vector = [0.3580376395471989, -0.6023495712049978, 0.18414012509913835, -0.26286205330961354, ..., 0.9029438446296592]
res = session.search(
db_name="my_database",
collection_name="my_collection",
anns_field="vector",
data=[query_vector],
limit=3,
output_fields=["product_id", "title", "main_category", "price", "average_rating", "rating_number"],
search_params={"metric_type": "COSINE"}
)
curl --request POST \
--url "${PROJECT_ENDPOINT}/v2/vectordb/entities/search?cluster_id=inxx-xxxxxxxxxxxxxxxxx" \
--header "Authorization: Bearer ${TOKEN}" \
--header "Content-Type: application/json" \
-d '{
"dbName": "my_database",
"collectionName": "my_collection",
"data": [
[
0.3580376395471989,
-0.6023495712049978,
0.18414012509913835,
-0.26286205330961354,
0.9029438446296592
]
],
"annsField": "vector",
"limit": 3,
"outputFields": [
"product_id",
"title",
"main_category",
"price",
"average_rating",
"rating_number"
]
}'
Then, you can explore your data and find the most valuable subset. Then you can connect to a serving cluster, import the data into it, and serve it for production.