Chinese
The chinese
analyzer is designed specifically to handle Chinese text, providing effective segmentation and tokenization.
Definition
The chinese
analyzer consists of:
-
Tokenizer: Uses the
jieba
tokenizer to segment Chinese text into tokens based on vocabulary and context. For more information, refer to Jieba. -
Filter: Uses the
cnalphanumonly
filter to remove tokens that contain any non-Chinese characters. For more information, refer to Cnalphanumonly.
The functionality of the chinese
analyzer is equivalent to the following custom analyzer configuration:
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"tokenizer": "jieba",
"filter": ["cnalphanumonly"]
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "jieba");
analyzerParams.put("filter", Collections.singletonList("cnalphanumonly"));
const analyzer_params = {
"tokenizer": "jieba",
"filter": ["cnalphanumonly"]
};
// go
# restful
analyzerParams='{
"tokenizer": "jieba",
"filter": [
"cnalphanumonly"
]
}'
Configuration
To apply the chinese
analyzer to a field, simply set type
to chinese
in analyzer_params
.
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"type": "chinese",
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "chinese");
const analyzer_params = {
"type": "chinese",
}
// go
# restful
analyzerParams='{
"type": "chinese"
}'
The chinese
analyzer does not accept any optional parameters.
Examples
Before applying the analyzer configuration to your collection schema, verify its behavior using the run_analyzer
method.
Analyzer configuration:
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"type": "chinese",
}
// java
// javascript
// go
# restful
Verification using run_analyzer
:
- Python
- Java
- NodeJS
- Go
- cURL
# Sample text to analyze
sample_text = "Milvus 是一个高性能、可扩展的向量数据库!"
# Run the standard analyzer with the defined configuration
result = MilvusClient.run_analyzer(sample_text, analyzer_params)
print("Chinese analyzer output:", result)
// java
// javascript
// go
# restful
Expected output:
Chinese analyzer output: ['Milvus', '是', '一个', '高性', '性能', '高性能', '可', '扩展', '的', '向量', '数据', '据库', '数据库']