Skip to main content
Version: User Guides (BYOC)

Array Field

The Array type is used to store fields containing multiple values of the same data type. It provides a flexible way to store attributes with multiple elements, making it especially useful in scenarios where a set of related data needs to be saved. In Zilliz Cloud clusters, you can store Array fields alongside vector data, enabling more complex query and filtering requirements.

For example, in a music recommendation system, an Array field can store a list of tags for a song; in user behavior analysis, it can store user ratings for songs. Below is an example of a typical Array field:

{
"tags": ["pop", "rock", "classic"],
"ratings": [5, 4, 3]
}

In this example, tags and ratings are both Array fields. The tags field is a string array representing song genres like pop, rock, and classic, while the ratings field is an integer array representing user ratings for the song, ranging from 1 to 5. These Array fields provide a flexible way to store multi-value data, making it easier to perform detailed analysis during queries and filtering.

Add Array field

To use Array fields Zilliz Cloud clusters, define the relevant field type when creating the collection schema. This process includes:

  1. Setting datatype to the supported Array data type, ARRAY.

  2. Using the element_type parameter to specify the data type of elements in the array. This can be any scalar data type supported by Zilliz Cloud clusters, such as VARCHAR or INT64. All elements in the same Array must be of the same data type.

  3. Using the max_capacity parameter to define the maximum capacity of the array, i.e., the maximum number of elements it can contain.

Here’s how to define a collection schema that includes Array fields:

from pymilvus import MilvusClient, DataType

client = MilvusClient(uri="YOUR_CLUSTER_ENDPOINT")

schema = client.create_schema(
auto_id=False,
enable_dynamic_fields=True,
)

# Add an Array field with elements of type VARCHAR
schema.add_field(field_name="tags", datatype=DataType.ARRAY, element_type=DataType.VARCHAR, max_capacity=10)
# Add an Array field with elements of type INT64
schema.add_field(field_name="ratings", datatype=DataType.ARRAY, element_type=DataType.INT64, max_capacity=5)

# Add primary field
schema.add_field(field_name="pk", datatype=DataType.INT64, is_primary=True)

# Add vector field
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=3)

In this example:

  • tags is a string array with element_type set to VARCHAR, indicating that elements in the array must be strings. max_capacity is set to 10, meaning the array can contain up to 10 elements.

  • ratings is an integer array with element_type set to INT64, indicating that elements must be integers. max_capacity is set to 5, allowing up to 5 ratings.

  • We also add a primary key field pk and a vector field embedding.

📘Notes

The primary field and vector field are mandatory when you create a collection. The primary field uniquely identifies each entity, while the vector field is crucial for similarity search. For more details, refer to Primary Field & AutoId, Dense Vector, Binary Vector, or Sparse Vector.

Set index params

Setting index parameters for Array fields is optional but can significantly improve retrieval efficiency.

In the following example, we create an AUTOINDEX for the tags field, which means Zilliz Cloud clusters will automatically create an appropriate scalar index based on the data type. For more information, refer to AUTOINDEX Explained.

# Prepare index parameters
index_params = client.prepare_index_params() # Prepare IndexParams object

index_params.add_index(
field_name="tags", # Name of the Array field to index
index_type="AUTOINDEX", # Index type
index_name="inverted_index" # Index name
)

Moreover, you must create an index for the vector field before creating the collection. In this example, we use AUTOINDEX to simplify vector index setup.

# Add vector index
index_params.add_index(
field_name="embedding",
index_type="AUTOINDEX", # Use automatic indexing to simplify complex index settings
metric_type="COSINE" # Specify similarity metric type, such as L2, COSINE, or IP
)

Create collection

Use the defined schema and index parameters to create a collection:

client.create_collection(
collection_name="my_array_collection",
schema=schema,
index_params=index_params
)

Insert data

After creating the collection, you can insert data that includes Array fields.

data = [
{
"tags": ["pop", "rock", "classic"],
"ratings": [5, 4, 3],
"pk": 1,
"embedding": [0.12, 0.34, 0.56]
},
{
"tags": ["jazz", "blues"],
"ratings": [4, 5],
"pk": 2,
"embedding": [0.78, 0.91, 0.23]
},
{
"tags": ["electronic", "dance"],
"ratings": [3, 3, 4],
"pk": 3,
"embedding": [0.67, 0.45, 0.89]
}
]

client.insert(
collection_name="my_array_collection",
data=data
)

In this example:

  • Each data entry includes a primary field (pk), while tags and ratings are Array fields used to store tags and ratings.

  • embedding is a 3-dimensional vector field used for vector similarity searches.

Search and query

Array fields enable scalar filtering during searches, enhancing Milvus's vector search capabilities. You can query based on the properties of Array fields alongside vector similarity searches.

Filter queries

You can filter data based on properties of Array fields, such as accessing a specific element or checking if an array element meets a certain condition.

filter = 'ratings[0] < 4'

res = client.query(
collection_name="my_array_collection",
filter=filter,
output_fields=["tags", "ratings", "embedding"]
)

print(res)

# Output
# data: ["{'pk': 3, 'tags': ['electronic', 'dance'], 'ratings': [3, 3, 4], 'embedding': [np.float32(0.67), np.float32(0.45), np.float32(0.89)]}"]

In this query, Zilliz Cloud clusters filters out entities where the first element of the ratings array is less than 4, returning entities that match the condition.

Vector search with Array filtering

By combining vector similarity with Array filtering, you can ensure that the retrieved data is not only similar in semantics but also meets specific conditions, making the search results more accurate and aligned with business needs.

filter = 'tags[0] == "pop"'

res = client.search(
collection_name="my_array_collection",
data=[[0.3, -0.6, 0.1]],
limit=5,
search_params={"params": {"nprobe": 10}},
output_fields=["tags", "ratings", "embedding"],
filter=filter
)

print(res)

# Output
# data: ["[{'id': 1, 'distance': 1.1276001930236816, 'entity': {'ratings': [5, 4, 3], 'embedding': [0.11999999731779099, 0.3400000035762787, 0.5600000023841858], 'tags': ['pop', 'rock', 'classic']}}]"]

In this example, Zilliz Cloud returns the top 5 entities most similar to the query vector, with the tags array's first element being "pop".

Additionally, Zilliz Cloud supports advanced Array filtering operators like ARRAY_CONTAINS, ARRAY_CONTAINS_ALL, ARRAY_CONTAINS_ANY, and ARRAY_LENGTH to further enhance query capabilities. For more details, refer to ARRAY Operators.

Limits

  • Data Type: All elements in an Array field must have the same data type, as specified by the element_type.

  • Array Capacity: The number of elements in an Array field must be less than or equal to the maximum capacity defined when the Array was created, as specified by max_capacity.

  • String Handling: String values in Array fields are stored as-is, without semantic escaping or conversion. For example, 'a"b', "a'b", 'a\'b', and "a\"b" are stored as entered, while 'a'b' and "a"b" are considered invalid values.