Skip to main content
Version: User Guides (BYOC)

Lindera

The lindera tokenizer performs dictionary-based morphological analysis. It is designed for Japanese and Korean—languages where words are not separated by spaces and grammatical markers (particles) attach directly to words.

📘Notes

For Chinese text: While lindera supports Chinese via the cc-cedict dictionary, we recommend using the jieba tokenizer instead. Jieba is specifically designed for Chinese word segmentation and provides better results.

Overview

Japanese and Korean are agglutinative languages: grammatical markers called particles attach directly to nouns, forming numerous combinations. For example:

Language

Root word

  • Particle

= Combined form

Meaning

Korean

서울 (Seoul)

에서

서울에서

in Seoul

Japanese

東京 (Tokyo)

東京に

to Tokyo

The lindera tokenizer:

  1. Segments text into individual morphemes (words and particles)

  2. Tags each token with part-of-speech (POS) information from the dictionary

  3. Applies filters to remove unwanted tokens (e.g., particles, punctuation)

This two-stage process—segmentation followed by POS-based filtering—enables precise control over which tokens are indexed for search.

Configuration

To configure an analyzer using the lindera tokenizer, set tokenizer.type to lindera, choose a dictionary with dict_kind, and optionally apply filters.

analyzer_params = {
"tokenizer": {
"type": "lindera",
"dict_kind": "ko-dic",
"filter": [
{
"kind": "korean_stop_tags",
"tags": ["SP", "SSC", "SSO", "SC", "SE", "SF", "JKS", "JKC", "JKG", "JKO", "JKB", "JKV", "JKQ", "JX", "JC", "UNK", "EP", "ETM"]
}
]
}
}

Parameter

Description

type

The type of tokenizer. This is fixed to "lindera".

dict_kind

A dictionary used to define vocabulary. Possible values:

  • ko-dic: Korean - Korean morphological dictionary (MeCab Ko-dic)

  • ipadic: Japanese - Standard morphological dictionary (MeCab IPADIC)

filter

A list of tokenizer-level filters to apply after segmentation. Each filter is an object with:

  • kind: The filter type. Supported values:

    • korean_stop_tags: Remove tokens matching specified Korean POS tags.

    • japanese_stop_tags: Remove tokens matching specified Japanese POS tags.

  • tags: A list of POS tags to filter out. The available tags depend on the kind:

    • For korean_stop_tags: Use exact tag codes (e.g., JKS, JKO, SF). Korean tags require exact matching. For the complete list based on the Sejong tagset, see the Lindera Korean stop tags source.

    • For japanese_stop_tags: Use exact tag codes (e.g., 助詞,格助詞, 助詞,係助詞, 助動詞). Japanese tags require exact matching. For the complete list (IPADIC), see Japanese POS tags reference.

After defining analyzer_params, you can apply them to a VARCHAR field when defining a collection schema. This allows Zilliz Cloud to process the text in that field using the specified analyzer for efficient tokenization and filtering. For details, refer to Example use.

Examples

Before applying the analyzer configuration to your collection schema, verify its behavior using the run_analyzer method.

Korean example

from pymilvus import MilvusClient

client = MilvusClient(uri="YOUR_CLUSTER_ENDPOINT")

analyzer_params = {
"tokenizer": {
"type": "lindera",
"dict_kind": "ko-dic",
"filter": [
{
"kind": "korean_stop_tags",
"tags": ["SP", "SSC", "SSO", "SC", "SE", "SF", "JKS", "JKC", "JKG", "JKO", "JKB", "JKV", "JKQ", "JX", "JC", "UNK", "EP", "ETM"]
}
]
}
}

# Sample Korean text: "서울에서 맛있는 음식을 먹었습니다" (I ate delicious food in Seoul)
sample_text = "서울에서 맛있는 음식을 먹었습니다"

result = client.run_analyzer(sample_text, analyzer_params)
print("Analyzer output:", result)

Expected output:

['서울', '맛있', '음식', '먹', '습니다']

Without korean_stop_tags, the output would include particles like 에서 (in), (topic marker), and (object marker), which are typically not useful for search.

Japanese example

from pymilvus import MilvusClient

client = MilvusClient(uri="YOUR_CLUSTER_ENDPOINT")

analyzer_params = {
"tokenizer": {
"type": "lindera",
"dict_kind": "ipadic",
"filter": [
{
"kind": "japanese_stop_tags",
"tags": ["接続詞", "助詞,格助詞", "助詞,格助詞,一般", "助詞,格助詞,引用", "助詞,格助詞,連語", "助詞,係助詞", "助詞,終助詞", "助詞,接続助詞", "助詞,特殊", "助詞,副助詞", "助詞,副助詞/並立助詞/終助詞", "助詞,連体化", "助詞,副詞化", "助詞,並立助詞", "助動詞", "記号,一般", "記号,読点", "記号,句点", "記号,空白", "記号,括弧閉", "記号,括弧開", "その他,間投", "フィラー", "非言語音"]
}
]
}
}

# Sample Japanese text: "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です"
sample_text = "東京スカイツリーの最寄り駅はとうきょうスカイツリー駅です"

result = client.run_analyzer(sample_text, analyzer_params)
print("Analyzer output:", result)

Expected output:

['東京', 'スカイ', 'ツリー', '最寄り駅', 'とう', 'きょう', 'スカイ', 'ツリー', '駅']

Without japanese_stop_tags, the output would include particles like (possessive), (topic marker), and です (copula).