Chinese
The chinese
analyzer is designed specifically to handle Chinese text, providing effective segmentation and tokenization.
Definition
The chinese
analyzer consists of:
-
Tokenizer: Uses the
jieba
tokenizer to segment Chinese text into tokens based on vocabulary and context. For more information, refer to Jieba. -
Filter: Uses the
cnalphanumonly
filter to remove tokens that contain any non-Chinese characters. For more information, refer to Cnalphanumonly.
The functionality of the chinese
analyzer is equivalent to the following custom analyzer configuration:
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"tokenizer": "jieba",
"filter": ["cnalphanumonly"]
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "jieba");
analyzerParams.put("filter", Collections.singletonList("cnalphanumonly"));
const analyzer_params = {
"tokenizer": "jieba",
"filter": ["cnalphanumonly"]
};
analyzerParams = map[string]any{"tokenizer": "jieba", "filter": []any{"cnalphanumonly"}}
# restful
analyzerParams='{
"tokenizer": "jieba",
"filter": [
"cnalphanumonly"
]
}'
Configuration
To apply the chinese
analyzer to a field, simply set type
to chinese
in analyzer_params
.
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"type": "chinese",
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "chinese");
const analyzer_params = {
"type": "chinese",
}
analyzerParams = map[string]any{"type": "chinese"}
# restful
analyzerParams='{
"type": "chinese"
}'
📘Notes
The chinese
analyzer does not accept any optional parameters.
Examples
Analyzer configuration
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"type": "chinese",
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "chinese");
// javascript
analyzerParams = map[string]any{"type": "chinese"}
# restful
Expected output
Chinese analyzer output: ['Milvus', '是', '一个', '高性', '性能', '高性能', '可', '扩展', '的', '向量', '数据', '据库', '数据库']