Jieba
The jieba
tokenizer processes Chinese text by breaking it down into its component words.
Configuration
- Python
- Java
- NodeJS
- Go
- cURL
# Simple configuration: only specifying the tokenizer name
analyzer_params = {
"tokenizer": "jieba", # Use the default settings: dict=["_default_"], mode="search", hmm=true
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "jieba");
const analyzer_params = {
"tokenizer": "jieba",
};
analyzerParams = map[string]any{"tokenizer": "jieba"}
# restful
analyzerParams='{
"tokenizer": "jieba"
}'
Examples
Analyzer configuration
- Python
- Java
- NodeJS
- Go
- cURL
analyzer_params = {
"tokenizer": {
"type": "jieba",
"dict": ["结巴分词器"],
"mode": "exact",
"hmm": False
}
}
Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("type", "jieba");
analyzerParams.put("dict", Collections.singletonList("结巴分词器"));
analyzerParams.put("mode", "exact");
analyzerParams.put("hmm", false);
// javascript
analyzerParams = map[string]any{"type": "jieba", "dict": []any{"结巴分词器"}, "mode": "exact", "hmm": false}
# restful
Expected output
['milvus', '结巴分词器', '中', '文', '测', '试']