KubeBlocks
BlogsKubeBlocks Cloud
Back
  1. I. Introduction
    1. Background
    2. Solution
  2. II. Introduction to Chaos Mesh
    1. Core Architecture
    2. Advantages
  3. III. KubeBlocks Engine High Availability Testing
    1. Test Objectives
    2. Test Scenarios
    3. Test Execution
    4. Test Results
  4. IV. Summary and Outlook
    1. Short-term Plan
    2. Long-term Goals
    3. Best Practices
  5. References

Practical Experience in Validating KubeBlocks Addon Availability with Chaos Mesh

I. Introduction

Background

In cloud-native environments, managing high availability for complex database clusters is a core challenge for enterprises. KubeBlocks, an open-source multi-engine database management platform, is dedicated to ensuring the stability of database services in various failure scenarios. However, traditional testing methods struggle to simulate complex failures in real production environments, making it difficult to fully validate system resilience.

Solution

Chaos engineering, by actively injecting controllable failures, helps discover system weaknesses early and drives system hardening. Through systematic testing based on Chaos Mesh, we validated KubeBlocks' high availability performance in scenarios such as host anomalies, process anomalies, network anomalies, pressure anomalies, and system service anomalies: primary node failover in seconds, zero data loss, and summarized a series of best practices.

This article will introduce how to leverage the Chaos Mesh tool to validate and enhance KubeBlocks' high availability capabilities through fault injection exercises.

II. Introduction to Chaos Mesh

Chaos Mesh is an open-source chaos engineering platform used for chaos testing of distributed systems in Kubernetes environments. By simulating faults and anomalies (such as network latency, service failures, resource exhaustion, etc.), Chaos Mesh helps developers and operations personnel validate system stability, fault tolerance, and high availability.

Core Architecture

Architecture Diagram

Advantages

Chaos Mesh aligns with KubeBlocks' database management scenarios due to its native K8s integration, fine-grained fault control capabilities, and declarative experiment management.

  • K8s Native Integration: Implements fault injection based on CRD, directly operating on database Pods managed by KubeBlocks without requiring additional adaptation.
  • Fine-grained Fault Control Capabilities: Directly injects faults into K8s resources (e.g., simulating node crashes, network isolation, resource overload), allowing validation of KubeBlocks database clusters' failover capabilities, data synchronization mechanisms, and split-brain protection.
  • Declarative Experiment Management: Defines fault parameters (e.g., 100% packet loss rate, 100% CPU load) through YAML, adapting to KubeBlocks' declarative database management model, supporting high-frequency, repeatable chaos testing, automating experiment processes, and supporting integration with CI/CD pipelines.
  • Security Isolation: Utilizes cgroup to limit the scope of resource fault impact, ensuring that CPU/memory pressure only affects target containers, guaranteeing that other nodes in the KubeBlocks database cluster are undisturbed during testing, and avoiding fault contamination of the production environment.

Through declarative experiments, KubeBlocks' core high availability capabilities were validated: primary node failover in seconds, zero data loss, and continuous optimization of the multi-engine architecture was driven, providing a quantifiable and reproducible fault testing baseline for cloud-native database resilience.

III. KubeBlocks Engine High Availability Testing

Test Objectives

  • Validate the self-healing capabilities of various database engines managed by KubeBlocks in real fault scenarios.
  • Evaluate the effectiveness of cluster data consistency assurance mechanisms.
  • Detect the timeliness and accuracy of monitoring and alerting systems.

Test Scenarios

Fault TypeSimulated ScenarioExpected BehaviorValidation Goal
PodChaosPrimary node Pod forced deletionSecondary node quickly promotes to new primary; application connection briefly interrupted then restoredPrimary node election, failover time
PodChaosSingle replica Pod continuous restartsService availability unaffected; replica set automatically recoversEffectiveness of replica redundancy
NetworkChaosPrimary node network latency (1000ms+)Triggers primary node disconnection; cluster elects new primaryNetwork partition tolerance, split-brain protection
NetworkChaos100% packet loss between primary and secondary nodesPrimary-secondary replication delay increases; eventual consistency ensuredRobustness of asynchronous replication
NetworkChaosNetwork partition between nodesMajority partition continues serving; minority partition becomes unwritablePartition tolerance (PACELC)
StressChaosPrimary node CPU overload (100%)Primary node response slows down; may trigger liveness probe timeout leading to FailoverResource isolation, resource overload protection, probe sensitivity
StressChaosSecondary node memory pressure (OOM simulation)Secondary process crashes; K8s automatically restarts replicaResource isolation, process recovery capability
DNSChaosRandom internal DNS resolution failures within clusterInter-replica communication occasionally fails; relies on retry mechanism for recoveryService discovery reliability, client retries
TimeChaosPrimary node clock jumps forward 2 hoursMay cause Raft Term confusion or expired transactions, triggering primary node evictionClock drift sensitivity, logical clock assurance

Test Execution

Taking the primary node Pod forced deletion scenario as an example:

  1. Environment Deployment: KubeBlocks deploys the target database cluster (e.g., MySQL Cluster).

  1. Fault Definition: Write the chaos-experiment.yaml for the corresponding fault scenario.
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: test-primary-pod-kill
  namespace: default
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - kubeblocks-cloud-ns
    labelSelectors:
      app.kubernetes.io/instance: mysql-875777cc4
      kubeblocks.io/role: primary
  1. Inject Fault: kubectl apply -f chaos-experiment.yaml.
kubectl describe PodChaos test-primary-pod-kill

Status:
  Experiment:
    Container Records:
      Events:
        Operation:      Apply
        Timestamp:      2025-07-18T07:55:17Z
        Type:           Succeeded
      Id:               kubeblocks-cloud-ns/mysql-875777cc4-mysql-0
      Injected Count:   1
      Phase:            Injected
      Recovered Count:  0
      Selector Key:     .
    Desired Phase:      Run
Events:
  Type    Reason           Age   From            Message
  ----    ------           ----  ----            -------
  Normal  FinalizerInited  15m   initFinalizers  Finalizer has been inited
  Normal  Updated          15m   initFinalizers  Successfully update finalizer of resource
  Normal  Updated          15m   desiredphase    Successfully update desiredPhase of resource
  Normal  Applied          15m   records         Successfully apply chaos for kubeblocks-cloud-ns/mysql-875777cc4-mysql-0
  Normal  Updated          15m   records         Successfully update records of resource
  1. Cluster Monitoring: It can be seen that the secondary node quickly switched to the primary node, and the failed node automatically recovered and the service was normal.

  1. Result Analysis: Primary node failover in seconds (after the primary node Pod was deleted, the secondary node switched to the primary node in 2s).

kubectl logs -n kubeblocks-cloud-ns mysql-875777cc4-mysql-1 lorry
2025-07-18T07:55:17Z        INFO        DCS-K8S        pod selector: app.kubernetes.io/instance=mysql-875777cc4,app.kubernetes.io/managed-by=kubeblocks,apps.kubeblocks.io/component-name=mysql
2025-07-18T07:55:18Z        INFO        DCS-K8S        podlist: 2
2025-07-18T07:55:18Z        INFO        DCS-K8S        members count: 2
2025-07-18T07:55:18Z        DEBUG        checkrole        check member        {"member": "mysql-875777cc4-mysql-0", "role": ""}
2025-07-18T07:55:18Z        DEBUG        checkrole        check member        {"member": "mysql-875777cc4-mysql-1", "role": "secondary"}
2025-07-18T07:55:18Z        INFO        event        send event: map[event:Success operation:checkRole originalRole:secondary role:{"term":"1752825318001682","PodRoleNamePairs":[{"podName":"mysql-875777cc4-mysql-1","roleName":"primary","podUid":"ccf5126a-4784-4841-b238-4bf30f98b172"}]}]
2025-07-18T07:55:18Z        INFO        event        send event success        {"message": "{\"event\":\"Success\",\"operation\":\"checkRole\",\"originalRole\":\"secondary\",\"role\":\"{\\\"term\\\":\\\"1752825318001682\\\",\\\"PodRoleNamePairs\\\":[{\\\"podName\\\":\\\"mysql-875777cc4-mysql-1\\\",\\\"roleName\\\":\\\"primary\\\",\\\"podUid\\\":\\\"ccf5126a-4784-4841-b238-4bf30f98b172\\\"}]}\"}"}

Test Results

Based on Chaos Mesh fault injection tests on various database engines managed by KubeBlocks (including MySQL, PostgreSQL, Redis, MongoDB, and SQLServer, etc.), the test results validated KubeBlocks' effectiveness in ensuring database high availability.

Test ScenarioTest MetricTest Result
PodChaos - Primary Pod Forced DeletionFailover TimeMySQL/PostgreSQL/Redis/MongoDB ≤ 10 seconds; SQLServer Always On ≤ 30 seconds
Service RecoveryNew primary node automatically takes over; application connection interruption ≤ 2 seconds
Data ConsistencyZero data loss (ensured by WAL/Raft and other log synchronization)
PodChaos - Single Replica Pod Continuous RestartService Availability≥ 99.9% (requests automatically routed to healthy nodes during replica reconstruction)
Replica Recovery TimeK8s restarts Pod within 30 seconds; data synchronization delay ≤ 5 seconds
NetworkChaos - Primary Node Network LatencyFailover TriggerMySQL Raft Group database engine liveness probe timeout (default 15 seconds) triggers automatic primary election
Split-Brain ProtectionRaft consensus protocol prevents dual primaries; only majority partition can write
Performance ImpactRequest latency peak ≤ 35%; returns to normal after switchover
NetworkChaos - 100% Packet Loss Between Primary and Secondary NodesData SynchronizationNo data loss during asynchronous replication interruption; automatically catches up after recovery
NetworkChaos - Network Partition Between NodesPartition ToleranceMajority partition service remains available; minority partition rejects writes
StressChaos - Primary Node CPU Overload (100%)Failover TriggerRedis Sentinel primary node CPU sustained overload for 2 minutes triggers failover; new primary takes over; old primary automatically rejoins as replica after recovery
Resource IsolationSecondary node performance unaffected (K8s cgroup isolation effective)
StressChaos - Secondary Node Memory Pressure (OOM Simulation)Process RecoveryK8s automatically restarts Pod within 60 seconds; service self-heals
Data SynchronizationFull synchronization between primary and secondary nodes after restart; no state leakage
DNSChaos - Random Internal DNS Resolution Failures within ClusterService DiscoveryClient retry mechanism ensures request success rate ≥ 99.9%
TimeChaos - Primary Node Clock Jumps Forward 2 HoursTransaction IntegrityCommitted transactions are not rolled back; cluster state consistent after clock calibration

IV. Summary and Outlook

Through deep integration and practice of Chaos Mesh, KubeBlocks has initially established a standardized database high availability validation system, successfully addressing challenges in core fault scenarios such as Pod failures, network failures, resource pressure, time failures, and DNS failures. However, ensuring continuous high availability of database services is an endless journey. In the future, we plan to explore and practice in the following directions to continuously enhance KubeBlocks' availability assurance capabilities:

Short-term Plan

  • Scenario Refinement: Mixed fault injection (e.g., network latency + node restart), closer to complex, cascading fault patterns in real production environments.
  • Complexity Enhancement: Simulate regional failures (e.g., availability zone-level network isolation) to validate KubeBlocks' cross-domain disaster recovery and recovery capabilities in multi-AZ/Region deployment architectures.
  • Coverage: Extend chaos engineering practices to KubeBlocks' own control plane components to ensure the robustness of the platform itself.

Long-term Goals

  • Ecosystem Expansion: Explore deeper integration with broader cloud-native observability, alerting, and self-healing toolchains (e.g., Prometheus, AlertManager, Argo Rollouts) to build a closed-loop resilience assurance system.
  • Intelligent Evolution: Explore AIOPs-based intelligent fault prediction and exercise orchestration, automatically generating and executing the most valuable chaos experiments based on historical monitoring data, topology relationships, and risk models.

Best Practices

  • Identify system weaknesses through progressive chaos exercises (from single-point failures to mixed scenarios). Quantify resilience by combining the three golden metrics of monitoring (SLA/RTO/RPO). Integrate critical fault tests into the CI/CD pipeline for normalization validation.
  • Adjust deployment topology and parameters based on test data, link with monitoring systems to achieve minute-level fault perception, regularly validate the effectiveness of recovery plans, and drive continuous architectural optimization through cross-team exercises, ultimately building a high-availability closed loop of "fault exposure - plan execution - data-driven optimization."

References

  • [1] Introduction to Chaos Mesh: https://chaos-mesh.org/docs/
  • [2] Introduction to KubeBlocks: https://kubeblocks.io/docs/preview/user_docs/overview/introduction
  • [3] KubeBlocks v1.0.0 High Availability Test Report: https://kubeblocks.io/reports/kubeblocks/v1-0-0/TEST_REPORT_CHAOS

© 2025 ApeCloud PTE. Ltd.