Skip to main content
Version: User Guides (Cloud)

Doc Data

The Zilliz Cloud web UI provides a simplified and intuitive way of creating, running, and managing Pipelines while the RESTful API offers more flexibility and customization compared to the Web UI.

This guide walks you through the necessary steps to create doc pipelines, conduct a semantic search on your embedded doc data, and delete the pipeline if it is no longer needed.

Prerequisites and limitations

  • Ensure you have created a cluster deployed in us-west1 on Google Cloud Platform (GCP).

  • In one project, you can only create up to 100 pipelines of the same type. For more information, refer to Zilliz Cloud Limits.

Ingest doc data

To ingest any data, you need to first create an ingestion pipeline and then run it.

Create doc ingestion pipeline

  1. Navigate to your project.

  2. Click on Pipelines from the navigation panel. Then switch to the Overview tab and click Pipelines. To create a pipeline, click + Pipeline.

    create-pipeline

  3. Choose the type of pipeline to create. Click on + Pipeline button in the Ingestion Pipeline column.

    choose-pipeline

  4. Configure the Ingestion pipeline you wish to create.

    Parameters

    Description

    Target Cluster

    The cluster where a new collection will be automatically created with this Ingestion pipeline. Currently, this can only be a cluster deployed on GCP us-west1.

    Collection Name

    The name of the auto-created collection.

    Pipeline Name

    Name of the new Ingestion pipeline. It should only contain lowercase letters, numbers, and underscores.

    Description (Optional)

    The description of the new Ingestion pipeline.

    configure-ingestion-pipeline

  5. Add an INDEX function to the Ingestion pipeline by clicking + Function. For each Ingestion pipeline, you can add exactly one INDEX function.

    1. Enter function name.

    2. Select INDEX_DOC as the function type. An INDEX_DOC function can split doc file from object storage (as pre-signed url) or local upload into chunks and generate vector embeddings for the chunks.

    3. Choose the embedding model used to generate vector embeddings. Different document languages have distinct embedding models. Currently, there are 5 available models for the English language: zilliz/bge-base-en-v1.5, voyageai/voyage-2, voyageai/voyage-code-2, openai/text-embedding-3-small, and openai/text-embedding-3-large. For the Chinese language, only zilliz/bge-base-zh-v1.5 is available. The following chart briefly introduces each embedding model.

      Embedding Model

      Description

      zilliz/bge-base-en-v1.5

      Released by BAAI, this state-of-the-art open-source model is hosted on Zilliz Cloud and co-located with vector databases, providing good quality and best network latency.

      voyageai/voyage-2

      Hosted by Voyage AI. This general purpose model excels in retrieving technical documentation containing descriptive text and code. Its lighter version voyage-lite-02-instruct ranks top on MTEB leaderboard. This model is only available when language is ENGLISH.

      voyageai/voyage-code-2

      Hosted by Voyage AI. This model is optimized for software code, providing outstanding quality for retrieving software documents and source code. This model is only available when language is ENGLISH.

      voyageai/voyage-large-2

      Hosted by Voyage AI. This is the most powerful generalist embedding model from Voyage AI. It supports 16k context length (4x that of voyage-2) and excels on various types of text including technical and long-context documents. This model is only available when language is ENGLISH.

      openai/text-embedding-3-small

      Hosted by OpenAI. This highly efficient embedding model has stronger performance over its predecessor text-embedding-ada-002 and balances inference cost and quality. This model is only available when language is ENGLISH.

      openai/text-embedding-3-large

      Hosted by OpenAI. This is OpenAI's best performing model. Compared to text-embedding-ada-002, the MTEB score has increased from 61.0% to 64.6%. This model is only available when language is ENGLISH.

      zilliz/bge-base-zh-v1.5

      Released by BAAI, this state-of-the-art open-source model is hosted on Zilliz Cloud and co-located with vector databases, providing good quality and best network latency. This is the default embedding model when language is CHINESE.

      add-index-doc-function

    4. Click Add to save your function.

  6. (Optional) Continue to add another PRESERVE function if you need to preserve the metadata for your docs. A PRESERVE function adds additional scalar fields to the collection along with data ingestion.

    📘Notes

    For each Ingestion pipeline, you can add up to 50 PRESERVE functions.

    1. Click + Function.

    2. Enter function name.

    3. Configure the input field name and type. Supported input field types include Bool, Int8, Int16, Int32, Int64, Float, Double, and VarChar.

      📘Notes
      • Currently, the output field name must be identical to the input field name. The input field name defines the field name used when running the Ingestion pipeline. The output field name defines the field name in the vector collection schema where the preserved value is kept.

      • For VarChar fields, the value should be a string with a maximum length of 4,000 alphanumeric characters.

      • When storing date-time in scalar fields, it is recommended to use the Int16 data type for year data, and Int32 for timestamps.

    4. Click Add to save your function.

      add-preserve-function

  7. Click Create Ingestion Pipeline.

  8. Continue creating a Search pipeline and a Deletion pipeline that is auto-configured to be compatible with the just-created Ingestion pipeline.

    auto-create-doc-search-and-delete-pipelines

    📘Notes

    By default, the reranker feature is disabled in the auto-configured search pipeline. If you need to enable reranker, please manually create a new search pipeline.

Run doc ingestion pipeline

  1. Click the "▶︎" button next to your Ingestion pipeline. Alternatively, you can also click on the Playground tab.

    run-pipeline

  2. Ingest your file. Zilliz Cloud provides two ways to ingest your data.

    1. If you need to ingest a file in an object storage, you can directly input an S3 presigned URL or a GCS signed URL in the doc_url field in the code.

    • If you need to upload a local file, click Attach File. In the dialog popup, upload your local file. The file should be no more than 10 MB. Supported file formats include .txt, .pdf, .md, .html, .epub, .csv, .doc, .docx, .xls, .xlsx, .ppt, .pptx. Once the upload is successful, click Attach. If you have added a preserve function to this Ingestion pipeline, please configure the data field.
  3. Check the results.

  4. Remove the document to run again.

Search doc data

To search any data, you need to first create a search pipeline and then run it. Unlike Ingestion and Deletion pipelines, when creating a Search pipeline, the cluster and collection are defined at the function level instead of the pipeline level. This is because Zilliz Cloud allows you to search from multiple collections at a time.

Create doc search pipeline

  1. Navigate to your project.

  2. Click on Pipelines from the navigation panel. Then switch to the Overview tab and click Pipelines. To create a pipeline, click + Pipeline.

  3. Choose the type of pipeline to create. Click on + Pipeline button in the Search Pipeline column.

    create-search-pipeline

  4. Configure the Search pipeline you wish to create.

    Parameters

    Description

    Pipeline Name

    The name of the new Search pipeline. It should only contain lowercase letters, numbers, and underscores only.

    Description (Optional)

    The description of the new Search pipeline.

    configure-search-pipeline

  5. Add a function to the Search pipeline by clicking + Function. You can add exactly one function.

    1. Enter function name.

    2. Choose Target Cluster and Target collection. The Target Cluster must be a cluster deployed in us-west1 on Google Cloud Platform (GCP). and the Target Collection must be created by an Ingestion pipeline, otherwise the Search pipeline will not be compatible.

    3. Select SEARCH_DOC_CHUNK as the Function Type. A SEARCH_DOC_CHUNK function can convert the input query text to a vector embedding and retrieve the topK most relevant doc chunks.

    4. (Optional) Enable reranker if you want to rank the search results based on their relevance to the query to improve search quality. However, note that enabling reranker will lead to higher cost and search latency. By default, this feature is disabled. Once enabled, you can choose the model service used for reranking. Currently, only zilliz/bge-reranker-base is available.

      Reranker Model Service

      Description

      zilliz/bge-reranker-base

      Open-source cross-encoder architecture reranker model published by BAAI. This model is hosted on Zilliz Cloud.

      add-search-doc-function

    5. Click Add to save your function.

  6. Click Create Search Pipeline.

Run doc search pipeline

  1. Click the "▶︎" button next to your Search pipeline. Alternatively, you can also click on the Playground tab.

    run-pipeline

  2. Configure the required parameters. Click Run.

  3. Check the results.

  4. Enter new query text to rerun the pipeline.

Delete doc data

To delete any data, you need to first create a deletion pipeline and then run it.

📘Notes

You must create an Ingestion pipeline first. Upon successful creation of an Ingestion pipeline, you can create a Search pipeline and a Deletion pipeline to work with your newly created Ingestion pipeline.

Create doc deletion pipeline

  1. Navigate to your project.

  2. Click on Pipelines from the navigation panel. Then switch to the Overview tab and click Pipelines. To create a pipeline, click + Pipeline.

  3. Choose the type of pipeline to create. Click on + Pipeline button in the Deletion Pipeline column.

    create-deletion-pipeline

  4. Configure the Deletion pipeline you wish to create.

    Parameters

    Description

    Pipeline Name

    The name of the new Deletion pipeline. It should only contain lowercase letters, numbers, and underscores.

    Description (Optional)

    The description of the new Deletion pipeline.

    configure-deletion-pipeline

  5. Add a function to the Deletion pipeline by clicking + Function. You can add exactly one function.

    1. Enter function name.

    2. Select either PURGE_DOC_INDEX or PURGE_BY_EXPRESSION as the Function Type. A PURGE_DOC_INDEX function can delete all doc chunks with the specified doc_name while a PURGE_BY_EXPRESSION function can delete all entities matching the specified filter expression.

    3. Click Add to save your function.

  6. Click Create Deletion Pipeline.

Run doc deletion pipeline

  1. Click the "▶︎" button next to your Deletion pipeline. Alternatively, you can also click on the Playground tab.

    run-pipeline

  2. Input the name of the document to delete in the doc_name field. Click Run.

  3. Check the results.

Manage pipeline

The following are relevant operations that manages the created pipelines in the aforementioned steps.

View pipeline

Click Pipelines on the left navigation. Choose the Pipelines tab. You will see all the available pipelines.

view-pipelines-on-web-ui

Click on a specific pipeline to view its detailed information including its basic information, total usage, functions, and related connectors.

view-pipeline-details

📘Notes

The total usage data could delay by a few hours due to technical limitation.

You can also check the pipeline activities on the web UI.

view-pipelines-activities-on-web-ui

Delete pipeline

If you no longer need a pipeline, you can drop it. Note that dropping a pipeline will not remove the auto-created collection where it ingested data.

🚧Warning
  • Dropped pipelines cannot be recovered. Please be cautious with the action.

  • Dropping a data-ingestion pipeline does not affect the collection created along with the pipeline. Your data is safe.

To drop a pipeline on the web UI, click the ... button under the Actions column. Then click Drop.

delete-pipeline