Skip to main content
Version: User Guides (Cloud)

Chinese

The chinese analyzer is designed specifically to handle Chinese text, providing effective segmentation and tokenization.

Definition

The chinese analyzer consists of:

  • Tokenizer: Uses the jieba tokenizer to segment Chinese text into tokens based on vocabulary and context. For more information, refer to Jieba.

  • Filter: Uses the cnalphanumonly filter to remove tokens that contain any non-Chinese characters. For more information, refer to Cnalphanumonly.

The functionality of the chinese analyzer is equivalent to the following custom analyzer configuration:

analyzer_params = {
"tokenizer": "jieba",
"filter": ["cnalphanumonly"]
}

Configuration

To apply the chinese analyzer to a field, simply set type to chinese in analyzer_params.

analyzer_params = {
"type": "chinese",
}
📘Notes

The chinese analyzer does not accept any optional parameters.

Examples

Analyzer configuration

analyzer_params = {
"type": "chinese",
}

Expected output

Chinese analyzer output: ['Milvus', '是', '一个', '高性', '性能', '高性能', '可', '扩展', '的', '向量', '数据', '据库', '数据库']