Data Resilience
Zilliz Cloud, as a fully managed vector database service, delivers enterprise-grade High Availability (HA) and Disaster Recovery (DR) capabilities to ensure the continuous availability of your mission-critical data and services under various failure scenarios.
Core Capabilities
-
High Availability (HA): Automatic failure detection and rapid failover mechanisms ensure uninterrupted service operation during node, availability zone (AZ), or region-level outages.
-
Disaster Recovery (DR): Comprehensive backup and restore strategies enable rapid business recovery after major incidents.
-
Flexible Resilience Tiers: From Standard to enterprise-grade cross-region deployments, tailored to meet diverse RPO/RTO requirements across business scenarios.
-
Cost Optimization: Choose the most cost-effective resilience strategy based on business value and risk tolerance.
Key Concepts
Core Metrics
-
Recovery Point Objective (RPO): The maximum acceptable data loss, measured in time. For example, an RPO of 5 minutes means up to 5 minutes of recent data may be lost during a failure.
-
Recovery Time Objective (RTO): The maximum permissible time from failure onset to full service restoration, including failure detection, failover decision-making, and actual recovery.
-
Service Level Agreement (Uptime SLA): Zilliz Cloud’s commitment to service availability is usually expressed as a percentage (for example, 99.95% uptime means no more than 21.6 minutes of downtime each month).
Fault Tolerance Scope
-
Node-level fault tolerance: Failure of a single compute or storage node
-
AZ-level fault tolerance: Complete AZ outage (e.g., data center failure)
-
Region-level fault tolerance: Entire region service disruption (e.g., natural disaster)
-
Cloud provider-level fault tolerance: Multi-cloud deployment to mitigate risks from a single cloud vendor
Resilience Architecture Tiers
High Availability (HA) Tiers
Tier | Description | RPO | RTO | Write Latency / Replication Scheme | Fault Tolerance | SLA | Relative Cost |
|---|---|---|---|---|---|---|---|
Standard | Single-region, single-AZ deployment with multi-replica mechanism | 0 seconds | ≤1 minute | Write within single AZ; WAL replicated via Quorum | Node-level failure AZs: 1 Regions: 1 | No SLA guarantee | Low |
Enterprise | Single-region deployment across 3 AZs with automatic failover | 0 seconds | ≤1 minute | Cross-AZ writes; WAL replicated via Quorum | AZ-level failure AZs: 3 Regions: 1 | 99.95% | Medium |
Enterprise Multi-Replica | Active-active multi-replica architecture within region; read/write separation with fast failover | 0 seconds | ≤10 seconds | Cross-AZ writes; inter-replica sync via WAL | AZ-level failure AZs: 3 Regions: 1 | 99.99% | Medium–High |
Cross-Region HA | Multi-region/multi-cloud deployment with global load balancing | ≤10 seconds | Manual or auto failover: Auto: ≤3 minutes | Synchronous writes across AZs; asynchronous replication to other regions/clouds | Region-level failure AZs: ≥3 Regions: ≥2 | 99.99% | High |
Cross-region HA will be available in November 2025. Incremental backup will be available in December 2025.
Disaster Recovery (DR) Tiers
Tier | Description | RPO | Restore Speed | Backup Strategy | Use Case | Additional Cost |
|---|---|---|---|---|---|---|
Local Backup | Same-region object storage; scheduled full backups | Hourly | Minutes to hours | Full backups | Accidental deletion, logical error recovery | Low |
Cross-Region Backup | Backup data stored in a different region; protects against regional disasters | Hourly | Minutes to hours | Full backups replicated across regions/clouds | Regional disaster, compliance requirements | Medium |
Incremental Backup | Real-time incremental backups; fine-grained recovery points | ≤1 minute | Minutes to hours | Continuous capture of transaction logs | Point-in-time recovery for critical workloads | Medium–High |
Quick Selection Guide
Business Tiering & Resilience Recommendations
Tier 1 – Mission-Critical Workloads
-
Characteristics: 24/7 operation; even minutes of downtime cause significant loss; extremely high business value
-
Recommended: Cross-region HA + Enterprise Multi-Replica + Continuous Data Protection
-
Targets: RPO = 0s, RTO < 30s, cross-cloud/region DR
-
Expected Cost: High
Tier 2 – Important Business Systems
-
Characteristics: 24/7 operation; high stability requirements
-
Recommended: Enterprise Multi-Replica + Cross-region Backup
-
Targets: RPO = 0s, RTO < 30s
-
Expected Cost: Medium–High
Tier 3 – General Applications
-
Characteristics: Operates during business hours; cost-sensitive; tolerates some recovery time
-
Recommended: Enterprise + Local Backup
-
Targets: RPO = 0s, RTO < 3 minutes
-
Expected Cost: Low–Medium
Tier 4 – Non-Critical Workloads
-
Characteristics: Non-essential systems; cost-sensitive; accepts scheduled maintenance windows
-
Recommended: Standard + Local Backup
-
Targets: RPO = 0s, RTO < 3 minutes
-
Expected Cost: Low–Medium
Cost Optimization Decision Matrix
Business Impact | Data Value | Compliance Requirement | Recommended Solution | Cost Level |
|---|---|---|---|---|
Extremely High | Extremely High | Strict | Cross-region HA + Full DR | High |
High | High | Moderate | Enterprise Multi-Replica + Cross-region Backup | Medium–High |
Medium | Medium | Basic | Enterprise + Local Backup | Medium |
Low | Low | None | Standard + Basic Backup | Low |
Frequently Asked Questions (FAQ)
Q1: How do Standard and Enterprise plans achieve high availability?
Architecture Design
Zilliz Cloud uses a compute-storage disaggregated architecture with three data types:
-
Metadata: Stored in etcd (3 replicas, RAFT protocol)
-
Log Data: Stored in proprietary Woodpecker (Quorum protocol)
-
Raw & Index Data: Stored in object storage, inheriting cloud storage’s native HA
Compute Node HA
-
Managed by Kubernetes for automatic scheduling
-
Pods automatically respawn upon single-node or single-AZ failure
-
Coordinator reassigns segments to other QueryNodes
-
Indexes and data are reloaded from storage; recovery time < 1 minute
Cost Optimization
-
Uses multiple persistent replicas + dynamic in-memory loading
-
Avoids cost explosion from maintaining multiple in-memory replicas
-
Simplifies DR architecture
-
Leverages log and object storage bandwidth for faster recovery
-
Q2: How does the multi-replica mechanism work?
Core Mechanism
-
Shard Level: Multiple StreamNodes load the same shard with primary/standby roles
-
Segment Level: Multiple QueryNodes load the same segment; data persists as a single copy
Read/Write Separation
-
Writes: Handled by the primary StreamNode
-
Reads: Served by any standby StreamNode or QueryNode
Key Benefits
-
Fast Failover: Proxy automatically redirects traffic to standby nodes
-
Higher QPS: Multiple in-memory replicas improve read throughput
-
Smooth Upgrades: Rolling updates reduce service jitter and improve stability
Q3: How does Global Database enable cross-region high availability?
CDC Synchronization
-
Change Data Capture (CDC) synchronizes DDL, DML, and bulk import operations
-
Typical sync latency < 10 seconds
-
Enables cross-region/cross-cloud DR with very low RPO
Data Write Strategy
-
Data written synchronously across multiple AZs within the same region
-
Write latency is at inter-AZ level
-
In extreme failover scenarios, data loss < 10 seconds
The roadmap for 2026: Achieve RPO = 0 with cross-region Woodpecker
Failover Modes
-
Manual: Via OpenAPI or Web Console
-
Automatic: Zilliz health-check service detects failure and completes failover in 1–3 minutes
Access Patterns
Mode | Characteristics | Use Case |
|---|---|---|
Active-Standby DR | Primary handles reads/writes; standby is activated only during failover | Standard disaster recovery |
Active-Active (Multi-Read) | Primary writes; multiple regions serve reads (nearest-region read) | Global read-heavy, low-write workloads |
Multi-Primary (Coming in 2026) | Both regions accept writes; the user must avoid data conflicts | Cell-based or sharded deployments |
For the latest feature updates or technical support, please contact Zilliz Cloud support.