Many enterprises leverage persistent storage for Kubernetes to enhance the availability and reliability of their Kubernetes clusters. However, some may encounter issues similar to the following:
This problem is caused by the Kubernetes default mechanism. With an improved Pod High Availability (HA) feature, IOMesh prevents this issue and advances Kubernetes cluster availability. In this article, we will dive into the implementation of the Pod HA feature in IOMesh.
Kubernetes has default mechanisms that can lead to high availability failures for Pods when using RWO-type PVCs. If a Kubernetes worker node fails, the associated VolumeAttachment of an RWO PVC may not be deleted as intended due to these mechanisms. This could cause the system to mistakenly believe that the original Pod still uses the PVC, preventing the PVC from being bound to the Pod on the new node and ultimately impacting Pod availability.
The whole process is as follows:
Pod HA failure can be attributed to the incorrect decision made by the Kubernetes attach/detach controller (hereinafter referred to as A/D controller). The A/D controller is a built-in controller in Kubernetes, responsible for deciding which node the volume should be attached to or detached from according to the Pod scheduling decisions issued by the kube-scheduler.
To make an effective intervention, we analyzed the A/D controller’s working principle. This can be simplified as follows:
Now, let’s reframe the story of Pod HA failure in light of the working principle of A/D controller: Suppose Deployment-A manages Pod-A on Node-A and Pod-B on Node-B, and both Pods are bound to an RWO-type PVC. If Node-A loses network connectivity, Pod-A will transition to the Terminating state. In this situation, Kubernetes does not enforce the deletion of Pod-A, so the binding relationship between Pod-A and Node-A still exists in the desiredStateOfWorld cache data of the A/D controller, which will not trigger the deletion of the corresponding VolumeAttachment on Node-A. At this point, Deployment-A detects that the number of running replicas is lower than expected and decides to rebuild Pod-A on Node-C. Due to the default Kubernetes mechanisms for PVCs of RWO types, the A/D controller assumes that the corresponding VolumeAttachment still exists on Node-A and therefore rejects the binding request for Pod-A to the new node. As a result, Pod-A remains stuck in the ContainerCreating state, which eventually leads to high availability failure.
Based on the analysis above, it can be concluded that in order to successfully establish a new binding, either the existing VolumeAttachment needs to be deleted or it should be removed once Pod-A enters the Terminating state.
Initially, we attempted the first option: manually deleting the VolumeAttachment on Node-A to enable the reconstruction of Pod-A on Node-C. This approach yields immediate results and can be done automatically by adding a VolumeAttachment controller within the CSI driver.
However, after analyzing the code of the A/D controller, we found that the user-initiated deletion of VolumeAttachment is an unexpected behavior and may have some risks; When the A/D controller deletes a VolumeAttachment, it updates the cached information in actualStateOfWorld. However, if a user manually deletes a VolumeAttachment, the A/D controller will not be aware of it, resulting in inaccurate cached information in actualStateOfWorld and potentially leading to incorrect decisions by the A/D controller.
Therefore, we believe that it is better to delegate the deletion of the Pod to CSI, allowing the A/D controller to autonomously complete the detach and cache update process. This is how we implement the Pod HA feature in IOMesh.
To prevent the aforementioned issues and enhance cluster high availability, we introduce node awareness to IOMesh, allowing for automatic deletion* and rebuilding of Pods during node failures. * This function is enabled by default. Users can choose to disable this feature manually.
The implementation process of this mechanism is as follows:
The entire Pod HA process takes 7 minutes and 30 seconds. Out of this duration, 1 minute and 30 seconds is the default monitoring time from the node failure to the Pod deletion by CSI. Users can set this time length when configuring the IOMesh CSI driver. The remaining 6 minutes is the fixed time required for PVC detachment from the failed node, which is a predetermined parameter set by Kubernetes for the A/D controller.