Standard AnalyzerPublic Preview
The standard
analyzer is the default analyzer in Zilliz Cloud, which is automatically applied to text fields if no analyzer is specified. It uses grammar-based tokenization, making it effective for most languages.
Definition
The standard
analyzer consists of:
-
Tokenizer: Uses the
standard
tokenizer to split text into discrete word units based on grammar rules. For more information, refer to Standard. -
Filter: Uses the
lowercase
filter to convert all tokens to lowercase, enabling case-insensitive searches. For more information, refer to
The functionality of the standard
analyzer is equivalent to the following custom analyzer configuration:
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"tokenizer": "standard",
"filter": ["lowercase"]
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "english");
analyzerParams.put("filter", Collections.singletonList("lowercase"));
const analyzer_params = {
"tokenizer": "standard",
"filter": ["lowercase"]
};
// go
# restful
Configuration
To apply the standard
analyzer to a field, simply set type
to standard
in analyzer_params
, and include optional parameters as needed.
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"type": "standard", # Specifies the standard analyzer type
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "standard");
const analyzer_params = {
"type": "standard", // Specifies the standard analyzer type
}
// go
# restful
The standard
analyzer accepts the following optional parameters:
Parameter | Description |
---|---|
| An array containing a list of stop words, which will be removed from tokenization. Defaults to |
Example configuration of custom stop words:
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"type": "standard", # Specifies the standard analyzer type
"stop_words", ["of"] # Optional: List of words to exclude from tokenization
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "standard");
analyzerParams.put("stop_words", Collections.singletonList("of"));
analyzer_params = {
"type": "standard", // Specifies the standard analyzer type
"stop_words", ["of"] // Optional: List of words to exclude from tokenization
}
// go
# restful
After defining analyzer_params
, you can apply them to a VARCHAR
field when defining a collection schema. This allows Zilliz Cloud to process the text in that field using the specified analyzer for efficient tokenization and filtering. For more information, refer to Example use.
Examples
Before applying the analyzer configuration to your collection schema, verify its behavior using the run_analyzer
method.
Analyzer configuration:
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"type": "standard", # Standard analyzer configuration
"stop_words": ["for"] # Optional: Custom stop words parameter
}
// java
// javascript
// go
# restful
**Verification using run_analyzer
:
- Python
- Java
- NodeJS
- Go
- cURL
# Sample text to analyze
sample_text = "The Milvus vector database is built for scale!"
# Run the standard analyzer with the defined configuration
result = MilvusClient.run_analyzer(sample_text, analyzer_params)
print("Standard analyzer output:", result)
// java
// javascript
// go
# restful
Expected output:
Standard analyzer output: ['the', 'milvus', 'vector', 'database', 'is', 'built', 'scale']