Metrics & Alerts Reference
In this reference, you can find descriptions of monitoring metrics for Zilliz Cloud clusters, as well as alert targets that you can set up at organization and project levels.
Cluster metrics
The Metrics tab in the Zilliz Cloud console presents various graphical representations.
The table provides a description of each metric and the actions that you are advised to perform when the usage of your cluster resource exceeds a threshold.
Currently, free clusters offer only one metric, CU Capacity. To unlock a range of advanced metrics, upgrade your plan tier.
Metric Name | Unit | Description | Recommended Action |
---|---|---|---|
Pod Resources | |||
CPU Usage | Core | The number of CPU cores used by pods. | Regularly monitor and log resource usage to identify trends and potential bottlenecks. |
Network Inbound Flow | Mbps | The network inbound flow of pod. | Track and analyze the amount of data being received from external sources, helping you monitor network performance and identify potential network congestion or bandwidth issues. |
Network Outbound Flow | Mbps | The network outbound flow of pod. | Track and analyze the amount of data being sent to external sources, helping you monitor network performance and identify potential network congestion or bandwidth issues. |
Resources | |||
CU Computation | % | A measure of the utilized computational power relative to the total computational capacity of the CU. This metric is available only for Dedicated or BYOC clusters. | 70%-80%: Check service status and prepare for scaling up. > 90%: Scale up immediately to avoid service interruption. |
CU Capacity | % | A measure of the used capacity relative to the total capacity of the CU. This metric is available for Free, Dedicated or BYOC clusters. | 70%-80%: Check service status and prepare for scaling up. > 90%: Scale up immediately to avoid service interruption. 100%: When CU capacity reaches 100%, you will be unable to write data into the cluster. Please scale up immediately to avoid service interruption. |
Storage | GB | The total amount of persistent storage consumed by data and indexes. | Configure alerts for monitoring storage usage. |
Performance | |||
QPS/VPS (Read) | QPS/VPS | QPS: The number of read requests (search and query) per second. VPS: The number of read requests (search) on vectors per second. VPS is not available for query requests as query operations do not involve vectors. | Refer to benchmark for system performance monitoring. |
QPS/VPS (Write) | QPS/VPS | QPS: The number of write requests (insert, bulk insert, upsert, and delete) per second. VPS: The number of write requests (insert, bulk insert, upsert, and delete) on vectors per second. | Refer to benchmark for system performance monitoring. |
Latency (Read) | ms | The time elapsed between a client sending a read request (search and query) to a server and the client receiving a response. Selecting Average or P99 from the expanded dropdown menu on the right displays an average or P99 latency. | - |
Latency (Write) | ms | The time elapsed between a client sending a write request (insert, upsert, and delete) to a server and the client receiving a response. Selecting Average or P99 from the expanded dropdown menu on the right displays an average or P99 latency. | - |
Request Failure Rate (Read) | % | The percentage of failed read requests (search and query) in all read requests per second. | Configure alerts to monitor read request failure rate. |
Request Failure Rate (Write) | % | The percentage of failed write requests (insert, bulk insert, upsert, and delete) in all write requests per second. | Configure alerts to monitor write request failure rate. |
Slow Query Count | count/min | The number of slow query operations, including all search and query requests. By default, all requests whose latency is 5 seconds are considered slow queries. This metric type is available only for Dedicated clusters of the Enterprise edition or BYOC clusters. | Identify problematic queries and tune performance by adjusting cluster configuration as necessary. |
Cluster Write Performance Capacity | % | The current rate of write operations/write rate limit. This metric type is available only for Dedicated clusters of the Enterprise edition or BYOC clusters. | If the current rate is too high (suggested to be over 80%), it is recommended that you lower the write rate. |
Number of Flush Operations | count/min | The number of flush operations on a cluster. This metric type is available only for Dedicated clusters of the Enterprise edition or BYOC clusters. | Performing flush operations too frequently can negatively impact the overall performance of the cluster. For more information, refer to Zilliz Cloud Limits. |
Data | |||
Collection Count | count | The number of collections created in a cluster. | - |
Entity Count | count | The number of entities inserted into a cluster. Selecting a specific collection from the expanded dropdown menu on the right displays the number of entities at the collection level. | - |
Loaded Entities | count | The number of entities loaded (actively served) by a cluster. Selecting a specific collection from the expanded dropdown menu on the right displays the number of loaded entities at the collection level. This metric is available only for Dedicated or BYOC clusters. | - |
Number of Unloaded Collections | count | The number of unloaded collections in a cluster. This metric type is available only for Dedicated clusters of the Enterprise edition or BYOC clusters. |
Organization alerts
Organization alerts keep you informed about license-related issues such as the license cores and validity period.
Alert Target | Unit | Description | Recommended Action | Default Trigger Condition |
---|---|---|---|---|
License (Core Usage) | % | Monitor the percentage of used CPU cores against the total licensed cores. | > 70%: Assess future needs and prepare to renew or upgrade the license. > 100%: Renew or upgrade the license immediately to avoid operational disruptions. | WARNING: Trigger alerts when the number of used CPU cores reaches or exceeds 70% of the total. CRITICAL: Trigger alerts when the number of used CPU cores reaches or exceeds 100% of the total. |
License (Validity Period) | Day | Track the remaining days of license validity. | < 60 days: Start preparing to renew or upgrade the license. < 0 day (expired): Renew or upgrade the license immediately to avoid restrictions like the inability to create new clusters or scale up. | WARNING: Trigger alerts when the license validity is 60 days or less. CRITICAL: Trigger alerts when the license expires. |
Project alerts
Project alerts focus on the operational aspects of your clusters, including notifications on the CU usage, QPS thresholds, latency issues, and request anomalies, ensuring you maintain optimal cluster performance.
For each project alert target, the trigger condition includes a threshold value and a duration value that must be met for the alert to be triggered. The condition can be set to one of the following operators: >, >=, <, <=, =. The threshold value can be a numeric value, such as a number for metrics like query latency, query QPS, search QPS, CU Capacity, and CU Computation. The duration specifies how long the threshold must be exceeded, which is set to a minimum of 1 minute and a maximum of 30 minutes.
Default alert targets
Zilliz Cloud predefines common alert targets to ensure that critical issues are quickly identified and addressed with the appropriate actions.
For more information about recommended actions, refer to Cluster metrics.
Alert Target | Unit | Default Trigger Condition |
---|---|---|
CU Computation | % | WARNING: Trigger alerts at >70% utilized computational power for 10+ minutes. CRITICAL: Trigger alerts at >90% utilized computational power for 10+ minutes. |
CU Capacity | % | WARNING: Trigger alerts at >70% utilized CU capacity for 10+ minutes. CRITICAL: Trigger alerts at >90% utilized CU capacity for 10+ minutes. |
Search (QPS) | QPS | Trigger WARNING alerts at >50 search operations per second for 10+ minutes. |
Query (QPS) | QPS | Trigger WARNING alerts at >50 query operations per second for 10+ minutes. |
Search Latency (P99) | ms | Trigger WARNING alerts at P99 latency >1,000ms for 10+ minutes. |
Query Latency (P99) | ms | Trigger WARNING alerts at P99 latency >1,000ms for 10+ minutes. |
Custom alert targets
In addition to the predefined default project alerts , you can also configure custom alert targets as needed.
Alert Target | Description |
---|---|
Resource | |
Storage | Monitor storage usage and send notifications if the usage exceeds a threshold for a certain duration. |
Performance (read/write) | |
Bulk Insert (QPS) | Monitor the rate of bulk insert operations and send notifications if the rate exceeds a threshold for a certain duration. |
Delete (QPS) | Monitor the rate of delete operations and send notifications if the rate exceeds a threshold for a certain duration. |
Insert (QPS) | Monitor the rate of insert operations and send notifications if the rate exceeds a threshold for a certain duration. |
Insert (VPS) | Monitor the rate of vector insert operations and send notifications if the rate exceeds a threshold for a certain duration. |
Search (VPS) | Monitor the rate of vector search operations and send notifications if the rate exceeds a threshold for a certain duration. |
Upsert (QPS) | Monitor the rate of upsert operations and send notifications if the rate exceeds a threshold for a certain duration. |
Upsert (VPS) | Monitor the rate of vector upsert operations and send notifications if the rate exceeds a threshold for a certain duration. |
Writes to Cluster Are Disabled | Monitor the write operations to the cluster to ensure they are not prohibited. Please scale out immediately if write prohibition has been triggered. |
Performance (latency) | |
Delete Latency (Average) | Monitor the average latency for delete requests and send notifications if the latency exceeds a threshold for a certain duration. |
Delete Latency (P99) | Monitor the P99 latency for delete requests and send notifications if the latency exceeds a threshold for a certain duration. |
Insert Latency (Average) | Monitor the average latency for insert requests and send notifications if the latency exceeds a threshold for a certain duration. |
Insert Latency (P99) | Monitor the P99 latency for insert requests and send notifications if the latency exceeds a threshold for a certain duration. |
Query Latency (Average) | Monitor the average latency for query requests and send notifications if the latency exceeds a threshold for a certain duration. |
Search Request Latency (Average) | Monitor the average latency for search requests and send notifications if the latency exceeds a threshold for a certain duration. |
Upsert Latency (Average) | Monitor the average latency for upsert requests and send notifications if the latency exceeds a threshold for a certain duration. |
Upsert Latency (P99) | Monitor the P99 latency for upsert requests and send notifications if the latency exceeds a threshold for a certain duration. |
Performance (request failure rate) | |
Bulk Insert Failure Rate | Monitor the failure rate of bulk insert requests and send notifications if the rate exceeds a threshold for a certain duration. |
Delete Failure Rate | Monitor the failure rate of delete requests and send notifications if the rate exceeds a threshold for a certain duration. |
Insert Failure Rate | Monitor the failure rate of insert requests and send notifications if the rate exceeds a threshold for a certain duration. |
Query Failure Rate | Monitor the failure rate of query requests and send notifications if the rate exceeds a threshold for a certain duration. |
Search Failure Rate | Monitor the failure rate of search requests and send notifications if the rate exceeds a threshold for a certain duration. |
Slow Query Count | Monitor the number of slow queries and send notifications if the value exceeds a threshold for a certain duration. By default, all requests whose latency is 5 seconds are considered slow queries. |
Upsert Failure Rate | Monitor the failure rate of upsert requests and send notifications if the rate exceeds a threshold for a certain duration. |
Data | |
Loaded Entities | Monitor the number of loaded entities and send notifications if the count exceeds a threshold for a certain duration. |
Total Collections | Monitor the number of total collections and send notifications if the count exceeds a threshold for a certain duration. |
Total Entities | Monitor the number of total entities and send notifications if the count exceeds a threshold for a certain duration. |
Others | |
Cluster Is Abnormal | Monitor the status of a cluster to ensure it is functioning properly. This includes checking the cluster load and usage. |