The Making of Kloudfuse 3.5: Designing Multi-Zone High Availability and Disaster Recovery

Two approaches to availability: automatic failover for zero downtime, or cost-optimized manual recovery

Table of Contents

Observability platforms monitor production infrastructure. But what happens when the observability platform itself goes down?

Dashboards go blank. Ingestion stops. Alerts stop firing. You're troubleshooting blind during the exact moment you need visibility most.

We built two availability options in Kloudfuse 3.5 because different organizations have different requirements: some need automatic failover with zero downtime, others need rapid recovery capability without doubling infrastructure costs.

The Availability Problem

Single-zone deployments provide no redundancy. A hardware failure, network issue, or cloud provider incident in that zone takes down your entire observability platform.

Traditional solutions involve manual failover procedures or active-passive configurations where standby clusters wait idle until primary systems fail. These work, but they require human intervention during incidents—the worst possible time to execute complex procedures.

Some vendors offer active-active multi-zone deployments. Others provide disaster recovery configurations. Few offer both with the flexibility to choose based on actual requirements rather than vendor-imposed defaults.

The challenge: balancing availability guarantees against infrastructure costs and operational complexity.

Two Approaches to Availability

Kloudfuse 3.5 supports two distinct availability models:

Multi-Zone High Availability: Automatic failover with zero downtime across three availability zones. No manual intervention. Requires running active infrastructure across multiple zones.

Disaster Recovery: Manual failover with quick recovery across regions. Lower resource requirements. Enables rapid recovery when triggered through backup and restore.

Organizations choose the model, or combination of models, that matches their availability requirements and cost constraints.

Multi-Zone High Availability

Zero Downtime, Automatic Failover

Multi-zone HA runs Kloudfuse across three availability zones simultaneously. Each zone contains complete infrastructure: ingestion pipelines, storage nodes, query engines, and control plane components.

When one zone fails, the platform continues operating automatically. No human intervention. No manual failover procedures. No coordination with application teams.

Distributed Architecture

The architecture distributes workloads across zones while maintaining data consistency through replication and pod anti-affinity rules.

Ingestion endpoints exist in all zones with automatic load balancing. When one zone becomes unavailable, traffic routes to healthy zones without application-level changes. Services sending telemetry continue writing to available ingestion endpoints without experiencing disruption.

Storage replication distributes data across zones with replicas in each zone. Critical services default to three replicas, one per zone. Single-zone failures don't cause data loss because replicas remain accessible in healthy zones.

Query operations span zones transparently. Query engines fetch data from whichever zones hold it, aggregating results regardless of zone topology. If a zone fails mid-query, the engine retrieves data from replicas in healthy zones automatically.

Automatic Failover Implementation

The critical design requirement: eliminate manual intervention during zone failures.

Distributed consensus maintains data consistency. Storage systems use leader election—when a leader becomes unreachable, remaining nodes elect a new leader within seconds and continue operations.

Health checking ensures availability. Load balancers continuously monitor endpoints, removing failed zones from rotation immediately. Ingestion pipelines use circuit breakers to detect unhealthy components and route traffic to healthy alternatives automatically.

Kubernetes orchestration handles pod failures. When control plane components in a failed zone become unavailable, the orchestration layer automatically reschedules pods to healthy nodes in other zones.

This automation matters because incidents are high-stress situations. Engineers shouldn't execute failover procedures when they should be investigating root causes.

Zero-Downtime Upgrades

Multi-zone architecture enables zero-downtime upgrades. Because replicas distribute across zones, the platform upgrades one zone at a time while others continue serving traffic. Ingestion and queries remain available throughout the upgrade process.

No maintenance windows. No scheduled downtime. No coordination about when monitoring will be unavailable.

Resource Requirements

Multi-zone HA requires running infrastructure across three availability zones. This architectural choice stems from consensus system requirements—three zones allow the system to maintain quorum and prevent split-brain scenarios during single-zone failures.

Each zone must contain equivalent infrastructure. Critical services run three replicas—one per zone—approximately tripling base capacity requirements. Cross-zone replication consumes network bandwidth.

This is the trade-off for automatic failover and zero downtime.

Disaster Recovery Configuration

Cost-Optimized Manual Failover

For organizations that need recovery capability without the cost of running full multi-zone infrastructure, Kloudfuse 3.5 supports disaster recovery configurations through backup and restore.

DR maintains a standby environment that activates when triggered. Unlike multi-zone HA where all zones run actively, DR uses a cold or warm standby approach with significantly lower resource requirements.

Architecture

Primary region runs full Kloudfuse deployment handling all production traffic.

DR region maintains access to backed-up data in cloud object storage. A secondary cluster can be created in a fail-over region when needed.

Data persistence happens continuously and periodically. Observability data (metrics, logs, traces) uploads continuously to cloud storage. Configuration and metadata back up periodically to the same storage location.

When the primary region experiences an outage, platform teams manually trigger restore to activate the DR region.

Manual Failover Design

DR requires manual triggering. This design is intentional—it gives platform teams control over failover timing and allows assessment of whether primary region issues are transient or require full failover.

When failover is triggered:

Restore process activates in DR region
Data rehydrates from cloud storage
Configuration and metadata restore from backups
Ingestion endpoints become active
Query engines begin serving traffic
DNS configuration updates to point to DR region

The manual trigger provides control. The activation process provides rapid recovery.

Recovery Time Objectives

DR recovery time depends on several factors:

Time to assess primary region status and decide to fail over
Time to trigger restore process
Data rehydration time (depends on data volume)
Configuration restoration time
DNS propagation delay

Platform components activate based on the restore process duration. External dependencies like DNS propagation typically contribute significantly to total recovery time.

Recovery Point Objective (how much data loss is acceptable) depends on backup frequency. Observability data backing up continuously minimizes data loss. Configuration backups occur periodically, so the RPO depends on that backup interval.

Resource Efficiency

DR configurations reduce costs compared to multi-zone HA:

Standby capacity doesn't require full compute resources until failover is triggered. The secondary cluster primarily accesses data in cloud storage rather than running active ingestion and query workloads.

Storage costs leverage cloud object storage pricing, which is significantly cheaper than active compute infrastructure.

Network costs for periodic backups are lower than continuous active-active cross-zone replication.

Organizations maintain recovery capability while reducing infrastructure costs substantially.

Choosing Your Availability Model

The decision depends on availability requirements, cost constraints, and operational preferences.

Multi-Zone HA When:

Zero downtime is required
Automatic failover is mandatory
Zone-level failures must be transparent to users
Budget supports running active infrastructure across three zones
Zero-downtime upgrades are needed
Consensus-based systems require quorum for fault tolerance

Use case: Production observability for services with strict SLAs where any downtime impacts business operations.

Disaster Recovery When:

Manual failover is acceptable
Cost optimization is priority
Regional-level redundancy is needed
Recovery measured in minutes is sufficient
Compliance requires cross-region backup capability
Brief data loss (based on backup frequency) is acceptable

Use case: Environments where cost efficiency matters and recovery windows measured in minutes are operationally acceptable.

Combined Deployment:

Some organizations deploy multi-zone HA in the primary region for automatic zone failover, with DR backups to a secondary region for regional disaster protection. The restore process handles differences between primary and DR configurations. This provides both zero-downtime zone failures and cross-region disaster capability.

Configuration Flexibility

Kloudfuse 3.5 supports flexible availability configurations:

Single-zone: Standard deployment for development environments and non-critical workloads
Three-zone HA: Multi-availability-zone deployment for automatic failover and maximum zone-level resilience
DR (Warm Standby): Periodic backups with quick manual recovery and reduced costs
DR (Cold Standby): Lowest cost approach with backup-only strategy and longer recovery time
Three-zone HA + DR: Automatic failover within region, manual failover across regions

You choose the configuration based on your specific requirements. The platform scales from single-zone simplicity to multi-region disaster recovery.

This flexibility matters because availability comes with costs. Different environments warrant different trade-offs. Production deployments might use three-zone HA for zero downtime. Staging environments might use single-zone with DR backup. Development environments might run single-zone without backup.

Why We Built Both

Observability infrastructure deserves the same reliability guarantees as the systems it monitors.

But reliability doesn't mean one-size-fits-all. Organizations have different availability requirements. Development environments don't need the same guarantees as production. Startups optimize differently than enterprises. Regional compliance requirements vary.

Multi-zone HA provides automatic failover for zero-downtime requirements. DR provides cost-optimized recovery for environments where manual failover measured in minutes is acceptable.

The platform supports both because availability architecture should match organizational requirements, not vendor limitations.

Learn more about platform engineering capabilities in Kloudfuse 3.5 in our launch announcement.