Hosted ModelsPrivate Preview
Zilliz Cloud can host embedding and reranking models on Zilliz-managed infrastructure. You can deploy dedicated, fully managed model instances and use them directly from Zilliz Cloud for stable and high-performance inference.
With a managed model instance, you can insert raw data into a collection. Zilliz Cloud automatically generates vector embeddings with the deployed model during ingestion. For semantic search, you only provide the raw query text. Zilliz Cloud uses the same model to create a query vector, compares it with stored vectors, and returns the most relevant results.
The following diagram shows the procedures for using hosted models.

Deploy a model
Currently, Zilliz Cloud supports the following regions, instance types, and models.
If you have specific requirements for hosted models, please contact us.
Supported regions
The model deployment region should be consistent with your cluster region. Available options include:
Region | Location |
|---|---|
aws-us-west-2 | Oregon, USA |
Supported instance type
The instance type determines the available compute resources. Available options include:
Instance Type | Resources |
|---|---|
g6.xlarge |
|
Supported models
Available options include:
Type | Model | Description |
|---|---|---|
Embedding |
| Lightweight multilingual embedding model for efficient semantic retrieval, code retrieval, classification, and clustering; supports 100+ languages, 32K context, and up to 1024-dimensional embeddings. |
| Balanced Qwen3 embedding model for stronger multilingual and cross-lingual retrieval quality while keeping deployment cost lower than the 8B model; supports 32K context and up to 2560-dimensional embeddings. | |
| Highest-capacity Qwen3 embedding model for accuracy-focused multilingual, long-text, and code retrieval workloads; supports 32K context and up to 4096-dimensional embeddings. | |
| Compact English BGE embedding model for low-cost, low-latency semantic search and retrieval; uses 384-dimensional embeddings. | |
| Compact Chinese BGE embedding model for efficient Chinese semantic search and retrieval; uses 512-dimensional embeddings. | |
| Mid-size English BGE embedding model that balances retrieval quality and efficiency; uses 768-dimensional embeddings. | |
| Mid-size Chinese BGE embedding model that balances quality and efficiency for Chinese retrieval workloads; uses 768-dimensional embeddings. | |
| High-quality English BGE embedding model for accuracy-sensitive semantic search, RAG, and retrieval workloads; uses 1024-dimensional embeddings. | |
| High-quality Chinese BGE embedding model for accuracy-sensitive Chinese semantic search and retrieval; uses 1024-dimensional embeddings. | |
Reranking |
| Lightweight English and Chinese cross-encoder reranker for reordering retrieved candidates with fast inference and easy deployment. |
| Larger English and Chinese cross-encoder reranker for higher-quality reranking when accuracy matters more than inference cost. | |
| Lightweight Qwen3 text reranking model for efficient multilingual and code-related retrieval workflows; supports 100+ languages, 32K context, and instruction-aware reranking. | |
| Balanced Qwen3 reranking model for stronger multilingual, cross-lingual, long-text, and code retrieval quality while keeping deployment cost below the 8B model. | |
| Highest-capacity Qwen3 reranking model for accuracy-focused retrieval scenarios that need strong multilingual, long-context, and instruction-aware ranking performance. | |
Semantic Highlighter |
| Lightweight bilingual semantic highlighting model for RAG and search workflows; identifies English or Chinese text segments that are semantically relevant to a query, helping users highlight useful context and reduce unnecessary tokens before generation. |
Obtain a deployment ID
Using the information you provide, Zilliz will deploy the model for you which takes about 15 minutes. When the deployment is ready, Zilliz Cloud Support will return a deployment ID, which you will use when creating embedding or reranking functions.
"deploymentId": "68f8889be4b01215a275972a"
Use the deployed model in a function
Once you have the deployment ID, you can create collections that use the deployed model through embedding or reranking functions.
Use an embedding function
-
Create a collection with embedding function.
-
Define at least one
VARCHARfield for the raw text. -
Define at least one vector field for the embedding vectors generated by the model.
-
Set the vector field dimension to match the model’s output dimension.
schema = milvus_client.create_schema()schema.add_field("id", DataType.INT64, is_primary=True, auto_id=False)schema.add_field("document", DataType.VARCHAR, max_length=9000)schema.add_field("dense", DataType.FLOAT_VECTOR, dim=384) # important, the dimension must be supported by the deployed model.# define embedding functiontext_embedding_function = Function(name="zilliz-bge-small-en-v1.5",function_type=FunctionType.TEXTEMBEDDING,input_field_names=["document"], # Scalar field(s) containing text data to embedoutput_field_names="dense", # Vector field(s) for storing embeddingsparams={"provider": "zilliz","model_deployment_id": "...", # Use the model deployment ID we provide you"truncation": True, # Optional: if true, inputs greater than the max supported input length of the model will be truncated"dimension": "384", # Optional: Shorten the output vector dimension, only if supported by the model})schema.add_function(text_embedding_function)index_params = milvus_client.prepare_index_params()index_params.add_index(field_name="dense",index_name="dense_index",index_type="AUTOINDEX",metric_type="IP",)ret = milvus_client.create_collection(collection_name, schema=schema, index_params=index_params, consistency_level="Strong") -
-
Insert raw text data.
Insert only the raw text into the collection. Zilliz Cloud automatically calls the embedding function and populates the vector field.
rows = [{"id": 1, "document": "Artificial intelligence was founded as an academic discipline in 1956."},{"id": 2, "document": "Alan Turing was the first person to conduct substantial research in AI."},{"id": 3, "document": "Born in Maida Vale, London, Turing was raised in southern England."},]insert_result = milvus_client.insert(collection_name, rows, progress_bar=True) -
Conduct a similarity search with raw text data.
Provide the query as raw text. Zilliz Cloud generates the query vector using the same model and performs the similarity search.
search_params = {"params": {"nprobe": 10},}queries = ["When was artificial intelligence founded","Where was Alan Turing born?"]result = milvus_client.search(collection_name, data=queries, anns_field="dense", search_params=search_params, limit=3, output_fields=["document"], consistency_level="Strong")
Use a reranking function
You can also configure a reranking function that uses the deployed model to rerank search results.
import numpy as np
rng = np.random.default_rng(seed=19530)
vectors_to_search = rng.random((1, dim))
# define reranking function
ranker = Function(
name="model_rerank_fn",
input_field_names=["document"],
function_type=FunctionType.RERANK,
params={
"reranker": "model",
"provider": "zilliz",
"model_deployment_id": "...", # Use the model deployment ID we provide you,
"queries": ["machine learning for time series"] * len(vectors_to_search), # Query text, the number of query strings must match exactly the number of queries in your search operation
}
)
# Use it during search
result = milvus_client.search(collection_name, vectors_to_search, limit=3, output_fields=["*"], ranker=ranker)
Use a semantic highlighter function
During search, you can use a hosted highlighter model to post-process your search results by highlighting text segments that are semantically related to the user's query.
from pymilvus import SemanticHighlighter
# Define the search query
queries = ["When was artificial intelligence founded"]
# Configure semantic highlighter
highlighter = SemanticHighlighter(
queries,
["document"], # Fields to highlight
pre_tags=["<mark>"], # Tag before highlighted text
post_tags=["</mark>"], # Tag after highlighted text
model_deployment_id="YOUR_MODEL_ID", # Deployed highlight model ID
)
# Perform search with highlighting
results = milvus_client.search(
collection_name,
data=queries,
anns_field="dense",
search_params={"params": {"nprobe": 10}},
limit=3,
output_fields=["document"],
highlighter=highlighter
)
# Process results
for hits in results:
for hit in hits:
highlight = hit.get("highlight", {}).get("document", {})
print(f"ID: {hit['id']}")
print(f"Search Score: {hit['distance']:.4f}") # Vector similarity score
print(f"Fragments: {highlight.get('fragments', [])}")
print(f"Highlight Confidence: {highlight.get('scores', [])}") # Semantic relevance score
print()
Billing
Using hosted models only incurs function and model services charges. Because inference runs within Zilliz Cloud, your data does not traverse the public internet—so you will not incur data transfer charges.
For model unit prices by region, please contact sales.
Cost calculation
Function and Model Services Cost = Model Unit Price x Usage Time
-
Model Unit Price: For details, contact sales.
-
Usage Time: The total time the model deployment is running, measured in hours, regardless of whether the model is actively used.