Skip to main content
Version: User Guides (Cloud)

Jieba

The jieba tokenizer processes Chinese text by breaking it down into its component words.

📘Notes

The jieba tokenizer preserves punctuation marks as separate tokens in the output. For example, "你好!世界。" becomes ["你好", "!", "世界", "。"]. To remove these standalone punctuation tokens, use the removepunct filter.

Configuration

# Simple configuration: only specifying the tokenizer name
analyzer_params = {
"tokenizer": "jieba", # Use the default settings: dict=["_default_"], mode="search", hmm=True
}

Examples

Analyzer configuration

analyzer_params = {
"tokenizer": {
"type": "jieba",
"dict": ["结巴分词器"],
"mode": "exact",
"hmm": False
}
}

Expected output

['milvus', '结巴分词器', '中', '文', '测', '试']