バージョン: User Guides (BYOC)

[説明] このページは機械翻訳された日本語版です。内容に誤りがございましたら、報告していただけると助かります。

Stop
Contact Sales to Enable BYOC

stop フィルターは、トークン化されたテキストから指定されたストップワードを削除し、一般的で意味の少ない単語を除去するのに役立ちます。ストップワードのリストは、stop_words パラメーターを使用して設定できます。

設定

stop フィルターは、ストップワードリストを stop_words パラメーターを通じてインラインで指定するか、stop_words_file パラメーターを通じて登録済みファイルリソースから読み込むことができます。

Inline stop-words list

インラインリストを使用して stop フィルターを利用するには、フィルター設定で "type": "stop" を指定し、ストップワードのリストを提供する stop_words パラメーターを設定します。

Python
Java
NodeJS
Go
cURL

analyzer_params = {
    "tokenizer": "standard",
    "filter":[{
        "type": "stop", # Specifies the filter type as stop
        "stop_words": ["of", "to", "_english_"], # Defines custom stop words and includes the English stop word list
    }],
}

Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "standard");
analyzerParams.put("filter",
        Collections.singletonList(
                new HashMap<String, Object>() {{
                    put("type", "stop");
                    put("stop_words", Arrays.asList("of", "to", "_english_"));
                }}
        )
);

const analyzer_params = {
    "tokenizer": "standard",
    "filter":[{
        "type": "stop", # Specifies the filter type as stop
        "stop_words": ["of", "to", "_english_"], # Defines custom stop words and includes the English stop word list
    }],
};

analyzerParams = map[string]any{"tokenizer": "standard",
    "filter": []any{map[string]any{
        "type":       "stop",
        "stop_words": []string{"of", "to", "_english_"},
    }}}

# restful
analyzerParams='{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "stop",
      "stop_words": [
        "of",
        "to",
        "_english_"
      ]
    }
  ]
}'

stop フィルターは、以下の設定可能なパラメーターを受け入れます。

パラメーター	説明
`stop_words`	トークン化から除外する単語のリストです。デフォルトでは、フィルターは組み込みの `english` 辞書を使用します。これを以下の 3 つの方法でオーバーライドまたは拡張できます：組み込み辞書 – 事前定義された辞書を使用するために、これらの言語エイリアスのいずれかを指定します： `"english"`, `"danish"`, `"dutch"`, `"finnish"`, `"french"`, `"german"`, `"hungarian"`, `"italian"`, `"norwegian"`, `"portuguese"`, `"russian"`, `"spanish"`, `"swedish"` カスタムリスト – 独自の用語の配列を渡します（例：`["foo", "bar", "baz"]`）。混合リスト – エイリアスとカスタム用語を組み合わせて使用します（例：`["of", "to", "english"]`）。各事前定義辞書の正確な内容については、stop_words を参照してください。

パラメーター

説明

stop_words

トークン化から除外する単語のリストです。デフォルトでは、フィルターは組み込みの english 辞書を使用します。これを以下の 3 つの方法でオーバーライドまたは拡張できます：

組み込み辞書 – 事前定義された辞書を使用するために、これらの言語エイリアスのいずれかを指定します：
"english", "danish", "dutch", "finnish", "french", "german", "hungarian", "italian", "norwegian", "portuguese", "russian", "spanish", "swedish"
カスタムリスト – 独自の用語の配列を渡します（例：["foo", "bar", "baz"]）。
混合リスト – エイリアスとカスタム用語を組み合わせて使用します（例：["of", "to", "english"]）。
各事前定義辞書の正確な内容については、stop_words を参照してください。

stop フィルターはトークナイザーによって生成された項に対して動作するため、トークナイザーと組み合わせて使用する必要があります。Zilliz Cloud で利用可能なトークナイザーの一覧については、トークナイザーリファレンスを参照してください。

analyzer_params を定義した後、コレクションスキーマを定義する際に VARCHAR フィールドにこれらを適用できます。これにより、Zilliz Cloud は指定された Analyzer を使用してそのフィールド内のテキストを処理し、効率的なトークン化とフィルタリングを実行できます。詳細については、使用例を参照してください。

ファイルリソースからストップワードを読み込む

大規模なカスタムストップワードリスト（言語固有のリスト、ドメイン語彙、あるいは多くのコレクション間で共有したいリストなど）の場合、単語をファイルに保存し、そのファイルをリモートファイルリソースとして登録してから、stop_words_file パラメーターを通じてフィルターから参照します。stop_words_file を単独で使用することも、インラインの stop_words と併用することも可能です。両方が設定されている場合、フィルターはこれら 2 つのソースをマージして単一のストップワードリストを作成します。

ファイルは UTF‑8 のプレーンテキストであり、1 行に 1 つのストップワードを記述します。例：

the
of
for

ファイルを Milvus クラスターが使用するように構成されているオブジェクトストアにアップロードし、その後登録します。

from pymilvus import MilvusClient

client = MilvusClient(uri="YOUR_CLUSTER_ENDPOINT")

# Register the uploaded file under a name you'll reference from analyzer configs.
client.add_file_resource(
    name="en_stop_words",
    path="file/stop_words.txt",    # full S3 object key, including the rootPath prefix
)

stop_words_file を使用して、フィルター内で登録済みリソースを参照します：

analyzer_params = {
    "tokenizer": "standard",
    "filter": [{
        "type": "stop",
        "stop_words_file": {
            "type": "remote",
            "resource_name": "en_stop_words",
            "file_name": "stop_words.txt",
        },
    }],
}

stop_words_file パラメータは、以下のフィールドを持つオブジェクトを受け入れます。

Field	Description
`type`	The resource type. Use `"remote"` for a file registered via `add_file_resource`.
`resource_name`	The name used when the file was registered with `add_file_resource`.
`file_name`	The filename portion of the registered resource's object-store path (for example, `"stop_words.txt"` if the resource was registered with `path="file/stop_words.txt"`).

Examples

アナライザー設定をコレクションスキーマに適用する前に、run_analyzer メソッドを使用してその動作を確認してください。

Analyzer configuration

Python
Java
NodeJS
Go
cURL

analyzer_params = {
    "tokenizer": "standard",
    "filter":[{
        "type": "stop", # Specifies the filter type as stop
        "stop_words": ["of", "to", "_english_"], # Defines custom stop words and includes the English stop word list
    }],
}

Map<String, Object> analyzerParams = new HashMap<>();
analyzerParams.put("tokenizer", "standard");
analyzerParams.put("filter",
        Collections.singletonList(
                new HashMap<String, Object>() {{
                    put("type", "stop");
                    put("stop_words", Arrays.asList("of", "to", "_english_"));
                }}
        )
);

// javascript

analyzerParams = map[string]any{"tokenizer": "standard",
    "filter": []any{map[string]any{
        "type":       "stop",
        "stop_words": []string{"of", "to", "_english_"},
    }}}

# restful

`run_analyzer` を使用した検証

Python
Java
NodeJS
Go
cURL

from pymilvus import (
    MilvusClient,
)

client = MilvusClient(uri="YOUR_CLUSTER_ENDPOINT")

# Sample text to analyze
sample_text = "The stop filter allows control over common stop words for text processing."

# Run the standard analyzer with the defined configuration
result = client.run_analyzer(sample_text, analyzer_params)
print("Standard analyzer output:", result)

import io.milvus.v2.client.ConnectConfig;
import io.milvus.v2.client.MilvusClientV2;
import io.milvus.v2.service.vector.request.RunAnalyzerReq;
import io.milvus.v2.service.vector.response.RunAnalyzerResp;

ConnectConfig config = ConnectConfig.builder()
        .uri("YOUR_CLUSTER_ENDPOINT")
        .build();
MilvusClientV2 client = new MilvusClientV2(config);

List<String> texts = new ArrayList<>();
texts.add("The stop filter allows control over common stop words for text processing.");

RunAnalyzerResp resp = client.runAnalyzer(RunAnalyzerReq.builder()
        .texts(texts)
        .analyzerParams(analyzerParams)
        .build());
List<RunAnalyzerResp.AnalyzerResult> results = resp.getResults();

// javascript

import (
    "context"
    "encoding/json"
    "fmt"

    "github.com/milvus-io/milvus/client/v2/milvusclient"
)

client, err := milvusclient.New(ctx, &milvusclient.ClientConfig{
    Address: "YOUR_CLUSTER_ENDPOINT",
    APIKey:  "YOUR_CLUSTER_TOKEN",
})
if err != nil {
    fmt.Println(err.Error())
    // handle error
}

bs, _ := json.Marshal(analyzerParams)
texts := []string{"The stop filter allows control over common stop words for text processing."}
option := milvusclient.NewRunAnalyzerOption(texts).
    WithAnalyzerParams(string(bs))

result, err := client.RunAnalyzer(ctx, option)
if err != nil {
    fmt.Println(err.Error())
    // handle error
}

# restful

期待される出力

['The', 'stop', 'filter', 'allows', 'control', 'over', 'common', 'stop', 'words', 'text', 'processing']

設定​

Inline stop-words list​

ファイルリソースからストップワードを読み込む​

Examples​

Analyzer configuration​

run_analyzer を使用した検証​

期待される出力​