メインコンテンツまでスキップ
バージョン: User Guides (Cloud)

Multi-language Analyzers

When Zilliz Cloud performs text analysis, it typically applies a single analyzer across an entire text field in a collection. If that analyzer is optimized for English, it struggles with the very different tokenization and stemming rules required by other languages, such as Chinese, Spanish, or French, resulting a lower recall rate. For instance, a search for the Spanish word "teléfono" (meaning "phone") would trip up an English‑focused analyzer: it may drop the accent and apply no Spanish‑specific stemming, causing relevant results to be overlooked.

Multi‑language analyzers resolve this issue by allowing you to configure multiple analyzers for a text field in a single collection. This way, you can store multilingual documents in a text field, and Zilliz Cloud analyzes text according to the appropriate language rules for each document.

Limits

  • This feature works only with BM25-based text retrieval and sparse vectors. For more information, refer to Full Text Search.

  • Each document in a single collection can use only one analyzer, determined by its language identifier field value.

  • Performance may vary depending on the complexity of your analyzers and the size of your text data.

Overview

The following diagram shows the workflow of configuring and using multi-language analyzers in Zilliz Cloud:

ZDYIwC1HwhTrdlbfOgNcOZ4OnWg

  1. Configure Multi-language Analyzers:

    • Set up multiple language-specific analyzers using the format: <analyzer_name>: <analyzer_config>, where each analyzer_config follows standard analyzer_params configuration as described in Analyzer Overview.

    • Define a special identifier field that will determine analyzer selection for each document.

    • Configure a default analyzer for handling unknown languages.

  2. Create Collection:

    • Define schema with essential fields:

      • primary_key: Unique document identifier.

      • text_field: Stores original text content.

      • identifier_field: Indicates which analyzer to use for each document.

      • vector_field: Stores sparse embeddings to be generated by the BM25 function.

    • Configure BM25 function and indexing parameters.

  3. Insert Data with Language Identifiers:

    • Add documents containing text in various languages, where each document includes an identifier value specifying which analyzer to use.

    • Zilliz Cloud selects the appropriate analyzer based on the identifier field, and documents with unknown identifiers use the default analyzer.

  4. Search with Language-Specific Analyzers:

    • Provide query text with an analyzer name specified, and Zilliz Cloud processes the query using the specified analyzer.

    • Tokenization occurs according to language-specific rules, and search returns language-appropriate results based on similarity.

Step 1: Configure multi_analyzer_params

The multi_analyzer_params is a single JSON object that determines how Zilliz Cloud selects the appropriate analyzer for each entity:

multi_analyzer_params = {
# Define language-specific analyzers
# Each analyzer follows this format: <analyzer_name>: <analyzer_params>
"analyzers": {
"english": {"type": "english"}, # English-optimized analyzer
"chinese": {"type": "chinese"}, # Chinese-optimized analyzer
"default": {"tokenizer": "icu"} # Required fallback analyzer
},
"by_field": "language", # Field determining analyzer selection
"alias": {
"cn": "chinese", # Use "cn" as shorthand for Chinese
"en": "english" # Use "en" as shorthand for English
}
}

Parameter

Required?

Description

Rules

analyzers

Yes

Lists every language‑specific analyzer that Zilliz Cloud can use to process text.

Each analyzer in analyzers follows this format: <analyzer_name>: <analyzer_params>.

  • Define each analyzer with the standard analyzer_params syntax (see Analyzer Overview).

  • Add an entry whose key is default; Zilliz Cloud falls back to this analyzer whenever the value stored in by_field does not match any other analyzer name.

by_field

Yes

Name of the field that stores, for every document, the language (that is, the analyzer name) Zilliz Cloud should apply.

  • Must be a VARCHAR field defined in the collection.

  • The value in every row must exactly match one of the analyzer names (or aliases) listed in analyzers.

  • If a row's value is missing or not found, Zilliz Cloud automatically applies the default analyzer.

alias

No

Creates shortcuts or alternative names for your analyzers, making them easier to reference in your code. Each analyzer can have one or more aliases.

Each alias must map to an existing analyzer key.

Step 2: Create collection

Creating a collection with multi-language support requires configuring specific fields and indexes:

Step 1: Add fields

In this step, define the collection schema with four essential fields:

  • Primary Key Field (id): A unique identifier for each entity in the collection. Setting auto_id=True enables Zilliz Cloud to automatically generate these IDs.

  • Language Indicator Field (language): This VARCHAR field corresponds to the by_field specified in your multi_analyzer_params. It stores the language identifier for each entity, which tells Zilliz Cloud which analyzer to use.

  • Text Content Field (text): This VARCHAR field stores the actual text data you want to analyze and search. Setting enable_analyzer=True is crucial as it activates text analysis capabilities for this field. The multi_analyzer_params configuration is attached directly to this field, establishing the connection between your text data and language-specific analyzers.

  • Vector Field (sparse): This field will store the sparse vectors generated by the BM25 function. These vectors represent the analyzable form of your text data and are what Zilliz Cloud actually searches.

# Import required modules
from pymilvus import MilvusClient, DataType, Function, FunctionType

# Initialize client
client = MilvusClient(
uri="YOUR_CLUSTER_ENDPOINT",
)

# Initialize a new schema
schema = client.create_schema()

# Step 2.1: Add a primary key field for unique document identification
schema.add_field(
field_name="id", # Field name
datatype=DataType.INT64, # Integer data type
is_primary=True, # Designate as primary key
auto_id=True # Auto-generate IDs (recommended)
)

# Step 2.2: Add language identifier field
# This MUST match the "by_field" value in language_analyzer_config
schema.add_field(
field_name="language", # Field name
datatype=DataType.VARCHAR, # String data type
max_length=255 # Maximum length (adjust as needed)
)

# Step 2.3: Add text content field with multi-language analysis capability
schema.add_field(
field_name="text", # Field name
datatype=DataType.VARCHAR, # String data type
max_length=8192, # Maximum length (adjust based on expected text size)
enable_analyzer=True, # Enable text analysis
multi_analyzer_params=multi_analyzer_params # Connect with our language analyzers
)

# Step 2.4: Add sparse vector field to store the BM25 output
schema.add_field(
field_name="sparse", # Field name
datatype=DataType.SPARSE_FLOAT_VECTOR # Sparse vector data type
)

Step 2: Define BM25 function

Define a BM25 function to generate sparse vector representations from your raw text data:

# Create the BM25 function
bm25_function = Function(
name="text_to_vector", # Descriptive function name
function_type=FunctionType.BM25, # Use BM25 algorithm
input_field_names=["text"], # Process text from this field
output_field_names=["sparse"] # Store vectors in this field
)

# Add the function to our schema
schema.add_function(bm25_function)

This function automatically applies the appropriate analyzer to each text entry based on its language identifier. For more information on BM25-based text retrieval, refer to Full Text Search.

Step 3: Configure index params

To allow efficient searching, create an index on the sparse vector field:

# Configure index parameters
index_params = client.prepare_index_params()

# Add index for sparse vector field
index_params.add_index(
field_name="sparse", # Field to index (our vector field)
index_type="AUTOINDEX", # Let Milvus choose optimal index type
metric_type="BM25" # Must be BM25 for this feature
)

The index improves search performance by organizing sparse vectors for efficient BM25 similarity calculations.

Step 4: Create the collection

This final creation step brings together all your previous configurations:

  • collection_name="multilang_demo" names your collection for future reference.

  • schema=schema applies the field structure and function you defined.

  • index_params=index_params implements the indexing strategy for efficient searches.

# Create collection
COLLECTION_NAME = "multilingual_documents"

# Check if collection already exists
if client.has_collection(COLLECTION_NAME):
client.drop_collection(COLLECTION_NAME) # Remove it for this example
print(f"Dropped existing collection: {COLLECTION_NAME}")

# Create the collection
client.create_collection(
collection_name=COLLECTION_NAME, # Collection name
schema=schema, # Our multilingual schema
index_params=index_params # Our search index configuration
)

At this point, Zilliz Cloud creates an empty collection with multi-language analyzer support, ready to receive data.

Step 3: Insert example data

When adding documents to your multi-language collection, each must include both text content and a language identifier:

# Prepare multilingual documents
documents = [
# English documents
{
"text": "Artificial intelligence is transforming technology",
"language": "english", # Using full language name
},
{
"text": "Machine learning models require large datasets",
"language": "en", # Using our defined alias
},
# Chinese documents
{
"text": "人工智能正在改变技术领域",
"language": "chinese", # Using full language name
},
{
"text": "机器学习模型需要大型数据集",
"language": "cn", # Using our defined alias
},
]

# Insert the documents
result = client.insert(COLLECTION_NAME, documents)

# Print results
inserted = result["insert_count"]
print(f"Successfully inserted {inserted} documents")
print("Documents by language: 2 English, 2 Chinese")

# Expected output:
# Successfully inserted 4 documents
# Documents by language: 2 English, 2 Chinese

During insertion, Zilliz Cloud:

  1. Reads each document's language field

  2. Applies the corresponding analyzer to the text field

  3. Generates a sparse vector representation via the BM25 function

  4. Stores both the original text and the generated sparse vector

📘Notes

You don't need to provide the sparse vector directly; the BM25 function generates it automatically based on your text and the specified analyzer.

Step 4: Perform search operations

Use English analyzer

When searching with multi-language analyzers, search_params contains crucial configuration:

  • metric_type="BM25" must match your index configuration.

  • analyzer_name="english" specifies which analyzer to apply to your query text. This is independent of the analyzers used on stored documents.

  • params={"drop_ratio_search": "0"} controls BM25-specific behavior; here, it retains all terms in the search. For more information, refer to Sparse Vector.

search_params = {
"metric_type": "BM25", # Must match index configuration
"analyzer_name": "english", # Analyzer that matches the query language
"drop_ratio_search": "0", # Keep all terms in search (tweak as needed)
}

# Execute the search
english_results = client.search(
collection_name=COLLECTION_NAME, # Collection to search
data=["artificial intelligence"], # Query text
anns_field="sparse", # Field to search against
search_params=search_params, # Search configuration
limit=3, # Max results to return
output_fields=["text", "language"], # Fields to include in the output
consistency_level="Bounded", # Data‑consistency guarantee
)

# Display English search results
print("\n=== English Search Results ===")
for i, hit in enumerate(english_results[0]):
print(f"{i+1}. [{hit.score:.4f}] {hit.entity.get('text')} "
f"(Language: {hit.entity.get('language')})")

# Expected output (English Search Results):
# 1. [2.7881] Artificial intelligence is transforming technology (Language: english)

Use Chinese analyzer

This example demonstrates switching to the Chinese analyzer (using its alias "cn") for different query text. All other parameters remain the same, but now the query text is processed using Chinese-specific tokenization rules.

search_params["analyzer_name"] = "cn"

chinese_results = client.search(
collection_name=COLLECTION_NAME, # Collection to search
data=["人工智能"], # Query text
anns_field="sparse", # Field to search against
search_params=search_params, # Search configuration
limit=3, # Max results to return
output_fields=["text", "language"], # Fields to include in the output
consistency_level="Bounded", # Data‑consistency guarantee
)

# Display Chinese search results
print("\n=== Chinese Search Results ===")
for i, hit in enumerate(chinese_results[0]):
print(f"{i+1}. [{hit.score:.4f}] {hit.entity.get('text')} "
f"(Language: {hit.entity.get('language')})")

# Expected output (Chinese Search Results):
# 1. [3.3814] 人工智能正在改变技术领域 (Language: chinese)