メインコンテンツまでスキップ
バージョン: User Guides (BYOC)

Chinese

The chinese analyzer is designed specifically to handle Chinese text, providing effective segmentation and tokenization.

Definition

The chinese analyzer consists of:

  • Tokenizer: Uses the jieba tokenizer to segment Chinese text into tokens based on vocabulary and context. For more information, refer to Jieba.

  • Filter: Uses the cnalphanumonly filter to remove tokens that contain any non-Chinese characters. For more information, refer to Cnalphanumonly.

The functionality of the chinese analyzer is equivalent to the following custom analyzer configuration:

analyzer_params = {
"tokenizer": "jieba",
"filter": ["cnalphanumonly"]
}

Configuration

To apply the chinese analyzer to a field, simply set type to chinese in analyzer_params.

analyzer_params = {
"type": "chinese",
}
📘Notes

The chinese analyzer does not accept any optional parameters.

Examples

Before applying the analyzer configuration to your collection schema, verify its behavior using the run_analyzer method.

Analyzer configuration:

analyzer_params = {
"type": "chinese",
}

Verification using run_analyzer:

# Sample text to analyze
sample_text = "Milvus 是一个高性能、可扩展的向量数据库!"

# Run the standard analyzer with the defined configuration
result = MilvusClient.run_analyzer(sample_text, analyzer_params)
print("Chinese analyzer output:", result)

Expected output:

Chinese analyzer output: ['Milvus', '是', '一个', '高性', '性能', '高性能', '可', '扩展', '的', '向量', '数据', '据库', '数据库']