Skip to main content
Version: User Guides (BYOC)

Export Data Using Iterators

This guide provides an example of how to export data from a Zilliz Cloud collection.

Overview

Both Milvus' Python and Java SDKs provide a set of iterator APIs for you to iterate over the entities within a collection in a memory-efficient manner. For details, refer to Search Iterator.

Using iterators offers the following benefits:

  • Simplicity: Eliminates the complex offset and limit settings.

  • Efficiency: Provides scalable data retrieval by fetching only the data in need.

  • Consistency: Ensures a consistent dataset size with boolean filters.

You can make use of these APIs to export certain or all of the entities from a Zilliz Cloud collection.

📘Notes

This feature is available for the Zilliz Cloud clusters that are compatible with Milvus 2.3.x and above.

Preparations

The following steps repurpose the code to connect to a Zilliz Cloud cluster, quickly set up a collection, and insert over 10,000 randomly generated entities into the collection.

Step 1: Create a collection

from pymilvus import MilvusClient

CLUSTER_ENDPOINT = "YOUR_CLUSTER_ENDPOINT"
TOKEN = "YOUR_CLUSTER_TOKEN"

# 1. Set up a Milvus client
client = MilvusClient(
uri=CLUSTER_ENDPOINT,
token=TOKEN
)

# 2. Create a collection
client.create_collection(
collection_name="quick_setup",
dimension=5,
)

Step 2: Insert randomly generated entities

# 3. Insert randomly generated vectors 
colors = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"]
data = []

for i in range(10000):
current_color = random.choice(colors)
current_tag = random.randint(1000, 9999)
data.append({
"id": i,
"vector": [ random.uniform(-1, 1) for _ in range(5) ],
"color": current_color,
"tag": current_tag,
"color_tag": f"{current_color}_{str(current_tag)}"
})

print(data[0])

# Output
#
# {
# "id": 0,
# "vector": [
# -0.5705990742218152,
# 0.39844925120642083,
# -0.8791287928610869,
# 0.024163154953680932,
# 0.6837669917169638
# ],
# "color": "purple",
# "tag": 7774,
# "color_tag": "purple_7774"
# }

res = client.insert(
collection_name="quick_setup",
data=data,
)

print(res)

# Output
#
# {
# "insert_count": 10000,
# "ids": [
# 0,
# 1,
# 2,
# 3,
# 4,
# 5,
# 6,
# 7,
# 8,
# 9,
# "(9990 more items hidden)"
# ]
# }

Export data using iterators

To export data using iterators, do as follows:

  1. Initialize the search iterator to define the search parameters and output fields. You can limit the number of entities to export per iteration by setting the batch_size parameter.

  2. Use the next() method within a loop to paginate through the search results.

    • If the method returns an empty array, the loop terminates.

    • Otherwise, save the returns in any manner that you see fit. For example, you can append the returns to a file, save them into a database, or feed them to other consumer programs.

  3. Call the close() method to close the iterator once all data has been retrieved.

The following code snippets demonstrate how to append the exported data into a file using the QueryIterator API.

import json
from pymilvus import connections, Collection

CLUSTER_ENDPOINT = "YOUR_CLUSTER_ENDPOINT"
TOKEN = "YOUR_CLUSTER_TOKEN"

connections.connect(
uri=CLUSTER_ENDPOINT,
token=TOKEN
)

collection = Collection("quick_setup")

# 6. Query with iterator

# Initiate an empty JSON file
with open('results.json', 'w') as fp:
fp.write(json.dumps([]))

iterator = collection.query_iterator(
batch_size=10,
expr="color_tag like \"brown_8%\"",
output_fields=["color_tag"]
)

while True:
result = iterator.next()
if not result:
iterator.close()
break

# Read existing records and append the returns
with open('results.json', 'r') as fp:
results = json.loads(fp.read())
results += result

# Save the result set
with open('results.json', 'w') as fp:
fp.write(json.dumps(results))