Analyzer OverviewPublic Preview
In text processing, an analyzer is a crucial component that converts raw text into a structured, searchable format. Each analyzer typically consists of two core elements: tokenizer and filter. Together, they transform input text into tokens, refine these tokens, and prepare them for efficient indexing and retrieval.
In Zilliz Cloud, analyzers are configured during collection creation when you add VARCHAR
fields to the collection schema. Tokens produced by an analyzer can be used to build an index for keyword matching or converted into sparse embeddings for full text search. For more information, refer to Text Match or Full Text Search.
The use of analyzers may impact performance:
Full text search: For full text search, DataNode and QueryNode channels consume data more slowly because they must wait for tokenization to complete. As a result, newly ingested data takes longer to become available for search.
Keyword match: For keyword matching, index creation is also slower since tokenization needs to finish before an index can be built.
Anatomy of an analyzer
An analyzer in Zilliz Cloud consists of exactly one tokenizer and zero or more filters.
-
Tokenizer: The tokenizer breaks input text into discrete units called tokens. These tokens could be words or phrases, depending on the tokenizer type.
-
Filters: Filters can be applied to tokens to further refine them, for example, by making them lowercase or removing common words.
Tokenizers support only UTF-8 format. Support for other formats will be added in future releases.
The workflow below shows how an analyzer processes text.
Analyzer types
Zilliz Cloud provides two types of analyzers to meet different text processing needs:
-
Built-in analyzer: These are predefined configurations that cover common text processing tasks with minimal setup. Built-in analyzers are ideal for general-purpose searches, as they require no complex configuration.
-
Custom analyzer: For more advanced requirements, custom analyzers allow you to define your own configuration by specifying both the tokenizer and zero or more filters. This level of customization is especially useful for specialized use cases where precise control over text processing is needed.
If you omit analyzer configurations during collection creation, Zilliz Cloud uses the standard
analyzer for all text processing by default. For details, refer to Standard.
Built-in analyzer
Built-in analyzers in Zilliz Cloud clusters are pre-configured with specific tokenizers and filters, allowing you to use them immediately without needing to define these components yourself. Each built-in analyzer serves as a template that includes a preset tokenizer and filters, with optional parameters for customization.
For example, to use the standard
built-in analyzer, simply specify its name standard
as the type
and optionally include extra configurations specific to this analyzer type, such as stop_words
:
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"type": "standard", # Uses the standard built-in analyzer
"stop_words": ["a", "an", "for"] # Defines a list of common words (stop words) to exclude from tokenization
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "standard");
analyzerParams.put("stop_words", Arrays.asList("a", "an", "for"));
const analyzer_params = {
"type": "standard", // Uses the standard built-in analyzer
"stop_words": ["a", "an", "for"] // Defines a list of common words (stop words) to exclude from tokenization
};
// go
export analyzerParams='{
"type": "standard",
"stop_words": ["a", "an", "for"]
}'
The configuration of the standard
built-in analyzer above is equivalent to setting up a custom analyzer with the following parameters, where tokenizer
and filter
options are explicitly defined to achieve similar functionality:
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"tokenizer": "standard",
"filter": [
"lowercase",
{
"type": "stop",
"stop_words": ["a", "an", "for"]
}
]
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "standard");
analyzerParams.put("filter",
Arrays.asList("lowercase",
new HashMap<String, Object>() {{
put("type", "stop");
put("stop_words", Arrays.asList("a", "an", "for"));
}}));
const analyzer_params = {
"tokenizer": "standard",
"filter": [
"lowercase",
{
"type": "stop",
"stop_words": ["a", "an", "for"]
}
]
};
// go
export analyzerParams='{
"type": "standard",
"filter": [
"lowercase",
{
"type": "stop",
"stop_words": ["a", "an", "for"]
}
]
}'
Zilliz Cloud offers the following built-in analyzers, each designed for specific text processing needs:
-
standard
: Suitable for general-purpose text processing, applying standard tokenization and lowercase filtering. -
english
: Optimized for English-language text, with support for English stop words. -
chinese
: Specialized for processing Chinese text, including tokenization adapted for Chinese language structures.
For a list of built-in analyzers and their customizable settings, refer to Built-in Analyzer Reference.
Custom analyzer
For more advanced text processing, custom analyzers in Zilliz Cloud allow you to build a tailored text-handling pipeline by specifying both a tokenizer and filters. This setup is ideal for specialized use cases where precise control is required.
Tokenizer
The tokenizer is a mandatory component for a custom analyzer, which initiates the analyzer pipeline by breaking down input text into discrete units or tokens. Tokenization follows specific rules, such as splitting by whitespace or punctuation, depending on the tokenizer type. This process allows for more precise and independent handling of each word or phrase.
For example, a tokenizer would convert text "Vector Database Built for Scale"
into separate tokens:
["Vector", "Database", "Built", "for", "Scale"]
Example of specifying a tokenizer:
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"tokenizer": "whitespace",
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "whitespace");
const analyzer_params = {
"tokenizer": "whitespace",
};
// go
export analyzerParams='{
"type": "whitespace"
}'
For a list of tokenizers available to choose from, refer to Tokenizer Reference.
Filter
Filters are optional components working on the tokens produced by the tokenizer, transforming or refining them as needed. For example, after applying a lowercase
filter to the tokenized terms ["Vector", "Database", "Built", "for", "Scale"]
, the result might be:
["vector", "database", "built", "for", "scale"]
Filters in a custom analyzer can be either built-in or custom, depending on configuration needs.
-
Built-in filters: Pre-configured by Zilliz Cloud, requiring minimal setup. You can use these filters out-of-the-box by specifying their names. The filters below are built-in for direct use:
-
lowercase
: Converts text to lowercase, ensuring case-insensitive matching. For details, refer to Lowercase. -
asciifolding
: Converts non-ASCII characters to ASCII equivalents, simplifying multilingual text handling. For details, refer to ASCII folding. -
alphanumonly
: Retains only alphanumeric characters by removing others. For details, refer to Alphanumonly. -
cnalphanumonly
: Removes tokens that contain any characters other than Chinese characters, English letters, or digits. For details, refer to Cnalphanumonly. -
cncharonly
: Removes tokens that contain any non-Chinese characters. For details, refer to Cncharonly.
Example of using a built-in filter:
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"tokenizer": "standard", # Mandatory: Specifies tokenizer
"filter": ["lowercase"], # Optional: Built-in filter that converts text to lowercase
}Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "standard");
analyzerParams.put("filter", Collections.singletonList("lowercase"));const analyzer_params = {
"tokenizer": "standard", // Mandatory: Specifies tokenizer
"filter": ["lowercase"], // Optional: Built-in filter that converts text to lowercase
}// go
export analyzerParams='{
"type": "standard",
"filter": ["lowercase"]
}' -
-
Custom filters: Custom filters allow for specialized configurations. You can define a custom filter by choosing a valid filter type (
filter.type
) and adding specific settings for each filter type. Examples of filter types that support customization:-
stop
: Removes specified common words by setting a list of stop words (e.g.,"stop_words": ["of", "to"]
). For details, refer to Stop. -
length
: Excludes tokens based on length criteria, such as setting a maximum token length. For details, refer to Length. -
stemmer
: Reduces words to their root forms for more flexible matching. For details, refer to Stemmer.
Example of configuring a custom filter:
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"tokenizer": "standard", # Mandatory: Specifies tokenizer
"filter": [
{
"type": "stop", # Specifies 'stop' as the filter type
"stop_words": ["of", "to"], # Customizes stop words for this filter type
}
]
}Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "standard");
analyzerParams.put("filter",
Collections.singletonList(new HashMap<String, Object>() {{
put("type", "stop");
put("stop_words", Arrays.asList("a", "an", "for"));
}}));const analyzer_params = {
"tokenizer": "standard", // Mandatory: Specifies tokenizer
"filter": [
{
"type": "stop", // Specifies 'stop' as the filter type
"stop_words": ["of", "to"], // Customizes stop words for this filter type
}
]
};// go
export analyzerParams='{
"type": "standard",
"filter": [
{
"type": "stop",
"stop_words": ["a", "an", "for"]
}
]
}'For a list of available filter types and their specific parameters, refer to Filter Reference.
-
Example use
In this example, you will create a collection schema that includes:
-
A vector field for embeddings.
-
Two
VARCHAR
fields for text processing:-
One field uses a built-in analyzer.
-
The other uses a custom analyzer.
-
Before incorporating these configurations into your collection, you'll verify each analyzer using the run_analyzer
method.
Step 1: Initialize MilvusClient and create schema
Begin by setting up the Milvus client and creating a new schema.
- Python
- Java
- NodeJS
- Go
- cURL
from pymilvus import MilvusClient, DataType
# Set up a Milvus client
client = MilvusClient(uri="YOUR_CLUSTER_ENDPOINT")
# Create a new schema
schema = client.create_schema(auto_id=True, enable_dynamic_field=False)
import io.milvus.v2.client.ConnectConfig;
import io.milvus.v2.client.MilvusClientV2;
import io.milvus.v2.common.DataType;
import io.milvus.v2.common.IndexParam;
import io.milvus.v2.service.collection.request.AddFieldReq;
import io.milvus.v2.service.collection.request.CreateCollectionReq;
// Set up a Milvus client
ConnectConfig config = ConnectConfig.builder()
.uri("YOUR_CLUSTER_ENDPOINT")
.build();
MilvusClientV2 client = new MilvusClientV2(config);
// Create schema
CreateCollectionReq.CollectionSchema schema = CreateCollectionReq.CollectionSchema.builder()
.enableDynamicField(false)
.build();
import { MilvusClient, DataType } from "@zilliz/milvus2-sdk-node";
// Set up a Milvus client
const client = new MilvusClient("YOUR_CLUSTER_ENDPOINT");
// go
# restful
Step 2: Define and verify analyzer configurations
-
Configure and verify a built-in analyzer (
english
):-
Configuration: Define the analyzer parameters for the built-in English analyzer.
-
Verification: Use
run_analyzer
to check that the configuration produces the expected tokenization.
- Python
- Java
- NodeJS
- Go
- cURL
# Built-in analyzer configuration for English text processing
analyzer_params_built_in = {
"type": "english"
}
# Verify built-in analyzer configuration
sample_text_built_in = "Milvus simplifies text analysis for search."
result_built_in = client.run_analyzer(sample_text_built_in, analyzer_params_built_in)
print("Built-in analyzer output:", result_built_in)
# Expected output:
# Built-in analyzer output: ['milvus', 'simplifi', 'text', 'analysi', 'search']// Add fields to schema
// Use a built-in analyzer
Map<String, Object> analyzerParamsBuiltin = new HashMap<>();
analyzerParamsBuiltin.put("type", "english");
// Add VARCHAR field `title_en`
schema.addField(AddFieldReq.builder()
.fieldName("title_en")
.dataType(DataType.VarChar)
.maxLength(1000)
.enableAnalyzer(true)
.analyzerParams(analyzerParamsBuiltin)
.enableMatch(true)
.build());// Use a built-in analyzer for VARCHAR field `title_en`
const analyzerParamsBuiltIn = {
type: "english",
};// go
# restful
-
-
Configure and verify a custom analyzer:
-
Configuration: Define a custom analyzer that uses a standard tokenizer along with a built-in lowercase filter and custom filters for token length and stop words.
-
Verification: Use
run_analyzer
to ensure the custom configuration processes text as intended.
- Python
- Java
- NodeJS
- Go
- cURL
# Custom analyzer configuration with a standard tokenizer and custom filters
analyzer_params_custom = {
"tokenizer": "standard",
"filter": [
"lowercase", # Built-in filter: convert tokens to lowercase
{
"type": "length", # Custom filter: restrict token length
"max": 40
},
{
"type": "stop", # Custom filter: remove specified stop words
"stop_words": ["of", "for"]
}
]
}
# Verify custom analyzer configuration
sample_text_custom = "Milvus provides flexible, customizable analyzers for robust text processing."
result_custom = client.run_analyzer(sample_text_custom, analyzer_params_custom)
print("Custom analyzer output:", result_custom)
# Expected output:
# Custom analyzer output: ['milvus', 'provides', 'flexible', 'customizable', 'analyzers', 'robust', 'text', 'processing']// Configure a custom analyzer
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "standard");
analyzerParams.put("filter",
Arrays.asList("lowercase",
new HashMap<String, Object>() {{
put("type", "length");
put("max", 40);
}},
new HashMap<String, Object>() {{
put("type", "stop");
put("stop_words", Arrays.asList("of", "for"));
}}
)
);// Configure a custom analyzer for VARCHAR field `title`
const analyzerParamsCustom = {
tokenizer: "standard",
filter: [
"lowercase",
{
type: "length",
max: 40,
},
{
type: "stop",
stop_words: ["of", "to"],
},
],
};// go
# curl
-
Step 3: Add fields to the schema
Now that you have verified your analyzer configurations, add them to your schema fields:
- Python
- Java
- NodeJS
- Go
- cURL
# Add VARCHAR field 'title_en' using the built-in analyzer configuration
schema.add_field(
field_name='title_en',
datatype=DataType.VARCHAR,
max_length=1000,
enable_analyzer=True,
analyzer_params=analyzer_params_built_in,
enable_match=True,
)
# Add VARCHAR field 'title' using the custom analyzer configuration
schema.add_field(
field_name='title',
datatype=DataType.VARCHAR,
max_length=1000,
enable_analyzer=True,
analyzer_params=analyzer_params_custom,
enable_match=True,
)
# Add a vector field for embeddings
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=3)
# Add a primary key field
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.addField(AddFieldReq.builder()
.fieldName("title")
.dataType(DataType.VarChar)
.maxLength(1000)
.enableAnalyzer(true)
.analyzerParams(analyzerParams)
.enableMatch(true) // must enable this if you use TextMatch
.build());
// Add vector field
schema.addField(AddFieldReq.builder()
.fieldName("embedding")
.dataType(DataType.FloatVector)
.dimension(3)
.build());
// Add primary field
schema.addField(AddFieldReq.builder()
.fieldName("id")
.dataType(DataType.Int64)
.isPrimaryKey(true)
.autoID(true)
.build());
// Create schema
const schema = {
auto_id: true,
fields: [
{
name: "id",
type: DataType.INT64,
is_primary: true,
},
{
name: "title_en",
data_type: DataType.VARCHAR,
max_length: 1000,
enable_analyzer: true,
analyzer_params: analyzerParamsBuiltIn,
enable_match: true,
},
{
name: "title",
data_type: DataType.VARCHAR,
max_length: 1000,
enable_analyzer: true,
analyzer_params: analyzerParamsCustom,
enable_match: true,
},
{
name: "embedding",
data_type: DataType.FLOAT_VECTOR,
dim: 4,
},
],
};
// go
# restful
Step 4: Prepare index parameters and create the collection
- Python
- Java
- NodeJS
- Go
- cURL
# Set up index parameters for the vector field
index_params = client.prepare_index_params()
index_params.add_index(field_name="embedding", metric_type="COSINE", index_type="AUTOINDEX")
# Create the collection with the defined schema and index parameters
client.create_collection(
collection_name="YOUR_COLLECTION_NAME",
schema=schema,
index_params=index_params
)
// Set up index params for vector field
List<IndexParam> indexes = new ArrayList<>();
indexes.add(IndexParam.builder()
.fieldName("embedding")
.indexType(IndexParam.IndexType.AUTOINDEX)
.metricType(IndexParam.MetricType.COSINE)
.build());
// Create collection with defined schema
CreateCollectionReq requestCreate = CreateCollectionReq.builder()
.collectionName("YOUR_COLLECTION_NAME")
.collectionSchema(schema)
.indexParams(indexes)
.build();
client.createCollection(requestCreate);
// Set up index params for vector field
const indexParams = [
{
name: "embedding",
metric_type: "COSINE",
index_type: "AUTOINDEX",
},
];
// Create collection with defined schema
await client.createCollection({
collection_name: "YOUR_COLLECTION_NAME",
schema: schema,
index_params: indexParams,
});
console.log("Collection created successfully!");
// go
# restful