Version: User Guides (Cloud)

Tokenizer Reference

This section provides a detailed reference for tokenizers.

Standard Tokenizer [READ MORE]

The `standard` tokenizer in Zilliz Cloud splits text based on spaces and punctuation marks, making it suitable for most languages.

Whitespace [READ MORE]

The `whitespace` tokenizer divides text into terms whenever there is a space between words.

Jieba [READ MORE]

The `jieba` tokenizer processes Chinese text by breaking it down into its component words.

Lindera [READ MORE]

The `lindera` tokenizer performs dictionary-based morphological analysis. It is a good choice for languages—such as Japanese, Korean, and Chinese—whose words are not separated by spaces.

The `icu` tokenizer is built on the Internationalization Components of Unicode open‑source project, which provides key tools for software internationalization. By using ICU's word‑break algorithm, the tokenizer can accurately split text into words across the majority of the world’s languages.

Language Identifier [READ MORE]

The `languageidentifier` is a specialized tokenizer designed to enhance the text search capabilities of Zilliz Cloud</zilliz> by automating the language analysis process. Its primary function is to detect the language of a text field and then dynamically apply a pre-configured analyzer that is most suitable for that language. This is particularly valuable for applications that handle a variety of languages, as it eliminates the need for manual language assignment on a per-input basis.

Standard Tokenizer [READ MORE]

Whitespace [READ MORE]

Jieba [READ MORE]

Lindera [READ MORE]

ICU [READ MORE]

Language Identifier [READ MORE]