DO380 - ch04s06

Bookmark this page

High Availability with Affinity Rules and Pod Disruption Budgets

Objectives

Configure workloads for resilience against node failures.

Scheduling Affinity

You can instruct the Kubernetes scheduler how to distribute the pods across the compute nodes in the cluster. The affinity rules enable the Kubernetes scheduler to set a preference to run the pods in a given cluster node, or to avoid running pods on a node where another set of pods is running.

Cluster administrators can set topology labels in the nodes to indicate the failure domain that they belong to. Application developers can configure the deployment resources to use the pod anti-affinity settings with custom topology labels to distribute the workloads across the failure domains.

Sometimes, the applications require multiple replicas to be available if a cluster disruption occurs. This type of requirement comes from load testing, which might indicate a need for a given number of replicas to give a suitable response time for users. The pod disruption budget resources set the minimum availability constraints to require that a given number or percentage of pods remain running even when a node is drained.

Pod Affinity and Anti-affinity

Pod affinity is a group of rules that are set in the pod specification. The scheduler uses the rules to place pods in the same cluster node where another workload is present, or to avoid placing pods in the same node where the pods from other workloads are running.

The pod affinity terms define a set of selectors to match the labels of other pods that are running in the cluster nodes. The scheduler can evaluate the pod affinity terms on a best-effort basis if the affinity setting is preferred, or require the conditions to be present to schedule the pod.

Kubernetes implements two types of pod affinity rules: pod affinity and pod anti-affinity.

Pod affinity

The pod affinity setting enables scheduling of pods in the same node where the pods from another workload are running. The pod specification contains the pod affinity setting, which uses a selector to match the labels from other pods by using an operator.

Pod affinity helps reduce the network latency between specific pods. Pod affinity can help improve the performance of workloads that require significant communication between pods.

In the following image, the pods have a pod affinity setting, to direct the scheduler to place them in the same node. The deployment is configured to run two replicas.

The scheduler places the first pod in the first node. No preference exists, because no pods match the app=custom label selector.

The scheduler looks for pods that match the app=custom label selector and sets a preference to run the second pod in the same node as the first pod. The second pod is then scheduled in the first node.

Pod anti-affinity

The pod anti-affinity setting prevents pods from being scheduled in the node where the pods from another workload are running. You can use a selector to match the labels from other pods. The pod specification contains the pod affinity setting that uses a selector to match the labels from other pods.

Pod anti-affinity prevents some workloads from running on the same node. Pod anti-affinity can help distribute a workload across failure domains to improve reliability, or prevent interference between workloads.

In the following image, the pods have a pod anti-affinity setting and direct the scheduler not to place them in the same node. The deployment is configured to run two replicas.

	The scheduler places the first pod in the first node. No preference exists, because no pods match the `app=custom` label selector.
	The scheduler looks for pods that match the `app=custom` label selector and sets a preference to avoid running the second pod in the same node as the first pod. The second pod is then scheduled on the second node.

Similar to node affinity, you can apply the required or preferred affinity terms to the pod affinity or pod anti-affinity settings.

Note

The scheduler evaluates the affinity terms only when the pod is created.

Required

The pod affinity terms are enforced when the pod is created, and the scheduler does not move the pod to another node if the conditions change.

The pod affinity rules include a list of selectors to match the labels of pods that are running in a cluster node.

Preferred

The scheduler evaluates the pod affinity terms on a best-effort basis. Thus, the scheduler might choose a node that does not satisfy all the requirements if no node matches all the conditions. The scheduler does not move the pod if the conditions change.

The preferred scheduling term has a list of weighted pod selector terms. Each term can match a label from running pods. Each rule has a weight value that is defined in the range 1-100 inclusive.

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: custom
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:  
      - weight: 100  
        podAffinityTerm:
          labelSelector:  
            matchExpressions:
            - key: app
              operator: In  
              values:
              - custom
          topologyKey: rack  
...content omitted...

	The expressions are evaluated on a best-effort basis and the pod is scheduled in a node even if the constraints are not met.
	The weight value is between 1-100 inclusive.
	The expression that matches a node label.
	The operator that matches the labels.
	The topology key that the scheduler uses to distribute pods across the failure domains.

Topology Keys

The Kubernetes scheduler can use node labels to indicate the failure domain that each node belongs to. The failure domain can be a rack that is connected to a different power source or networking equipment, or an indicator that the node is in another building. Kubernetes uses the following node labels as default topology keys to distribute the workloads across the cluster nodes:

kubernetes.io/hostname: This label is set to match the hostname value of each node in the cluster.
topology.kubernetes.io/zone: Represents the availability zone of the node. This label is set by the cloud provider and might not be present in on-premise deployments.
topology.kubernetes.io/region: Represents the region of the node. This label is set by the cloud provider and might not be present in on-premise deployments.

The topology key helps the scheduler to spread the pods across the failure domains. The cluster administrator must label all the nodes to indicate their corresponding failure domain. Application developers can use the node label as a custom topology key in the pod affinity terms, to direct the scheduler how to place the pods in the cluster nodes.

The following scenario describes a cluster distributed between two racks. Each rack is connected to a different power source. The nodes are distributed by rack according to the following table:

Topology Key	Control plane nodes	Compute nodes
`rack=rack-1`	`master02` , `master03`	`node01`
`rack=rack-2`	`master01`	`node02` , `node03`

The administrator added the rack label to indicate the location and failure domain of each node. This label is intended as a custom topology key, so that the scheduler can spread the pods evenly across the compute nodes in different racks.

Without a custom topology key: The value of the kubernetes.io/hostname topology key is different on each node, and the scheduler identifies each node as a separate failure domain. The scheduler uses the default topology keys, and distributes the pods evenly across all the compute nodes.

With a custom topology key: The rack topology key specifies the failure domain that each node belongs to. The scheduler spreads pods evenly across the failure domains.

	Three pods are placed in the `node01` compute node in the `rack-1` failure domain.
	The other three pods are placed in the nodes in the `rack-2` failure domain.

Pod Disruption Budgets

The pod disruption budget is a policy object where you can specify the percentage or number of pods that must remain running when a voluntary disruption occurs in the cluster. The pod disruption budget resource is part of the policy/v1 API group.

Kubernetes clusters have two types of disruptions: involuntary and voluntary.

Involuntary disruptions: Involuntary disruptions occur when a node fails and is disconnected from the cluster. The involuntary disruptions could be related to a hardware problem, a network issue within a zone, or a power issue in a rack. The Kubernetes cluster detects that a node is not responsive, creates replacement pods in another node, and eventually deletes all the pods that are scheduled on the failed node.
Voluntary disruptions: Voluntary disruptions occur when nodes are cordoned or taken offline for maintenance. When a voluntary disruption occurs, all the pods on the affected node are evicted at the same time, and the scheduler creates replacement pods in another node. If all the pods of a workload are running in the same node and that node is drained, then the minimum availability constraints for the application are not fulfilled.

You can define the pod disruption budget resource with either the minAvailable or the maxUnavailable attribute. The values for these settings are often discovered after load testing on the application. Sometimes, a given number of pods must run all the time to prevent the application from queuing the user requests. Another example is that the application delivers an acceptable response time for the users even if only 60% of the pods are running.

minAvailable

The minimum number of pods to be available, even during a voluntary disruption. You can set this attribute to a percentage or integer value to specify the minimum number of replica pods to be available.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-minavailable
spec:
  minAvailable: 80%  
  selector:  
    matchLabels:
      app: nginx

	The minimum percentage or number of pods to be available.
	The label selector for the affected pods.

maxUnavailable

The maximum number of pods that can be unavailable during a voluntary disruption. You can set this attribute to a percentage or integer value to specify the maximum number of replica pods that can be unavailable.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nginx-maxunavailable
spec:
  maxUnavailable: 33%  
  selector:  
    matchLabels:
      app: nginx

	The maximum percentage or number of pods that can be unavailable.
	The label selector for the affected pods.

You can list or describe the pod disruption budget resources to inspect their properties.

[user@host ~]$ oc get pdb nginx-minavailable
NAME                MIN AVAILABLE  MAX UNAVAILABLE  ALLOWED DISRUPTIONS  AGE
nginx-minavailable  80%            N/A              2                    45s

The minimum percentage or number of pods to be available.

You can also describe the resource to view the label selector for the affected pods.

[user@host ~]$ oc describe pdb nginx-maxunavailable
Name:             nginx-maxunavailable
Namespace:        scheduling-pdb
Max unavailable:  33%  
Selector:         app=nginx  
Status:
    Allowed disruptions:  4
    Current:              10
    Desired:              6
    Total:                10
Events:                   <none>

	The maximum percentage or number of pods that can be unavailable.
	The label selector for the affected pods.

Node Drain Without Pod Disruption Budget

The deployment is configured to run six replica pods that are distributed between the compute nodes. The application has an appropriate response time if five or more pods are running.

The first node is drained by using the oc adm drain command. The node is cordoned and marked as unschedulable.

[user@host ~]$ oc adm drain node/node01 ...
node/node01 cordoned
Warning: ignoring DaemonSet-managed Pods: ...output omitted...
… request.go:682] Waited for … due to client-side throttling ...output omitted...
...output omitted...

Note

You can safely ignore the warnings about managed pods and client-side throttling.

All the pods that are running in the drained node are marked for eviction. The scheduler creates replacement pods in another node. Only four pods are ready and the application starts responding slowly.

...output omitted...
evicting pod scheduling-pdb/nginx-56bdf7d8c6-kf98k
evicting pod scheduling-pdb/nginx-56bdf7d8c6-8kk2b
pod/nginx-56bdf7d8c6-kf98k evicted
pod/nginx-56bdf7d8c6-8kk2b evicted
...output omitted...

The compute node is finally marked as drained. The replacement pods are marked as ready. The application now runs all six pods and the response time is within user expectations.

...output omitted...
node/node01 drained

The full output of the oc adm drain command is as follows:

[user@host ~]$ oc adm drain node/node01 --ignore-daemonsets --delete-emptydir-data
node/node01 cordoned
Warning: ignoring DaemonSet-managed Pods: ...output omitted...
… request.go:682] Waited for … due to client-side throttling ...output omitted...
...output omitted...
evicting pod scheduling-pdb/nginx-56bdf7d8c6-kf98k  
evicting pod scheduling-pdb/nginx-56bdf7d8c6-8kk2b
...output omitted...
pod/nginx-56bdf7d8c6-kf98k evicted  
pod/nginx-56bdf7d8c6-8kk2b evicted
...output omitted...
node/node01 drained

	All the pods are marked for eviction when the node is drained.
	All the pods are evicted from the node at the same time and the application availability constraint is not met.

Node Drain with Pod Disruption Budget

With a pod disruption budget resource, you can specify the minimum availability constraints for the application. The node drain is blocked until the replacement pods for the application are scheduled in another node and become ready. The pod eviction is blocked until the pod disruption budget availability constraints are met. The pod eviction operation retries after five seconds.

The deployment is configured to run six replica pods that are distributed between the compute nodes. The application has an appropriate response time if five or more pods are running. The developer creates a pod disruption budget resource where the spec.minAvailable parameter is set to five, and specifies the selector to match the labels of the deployment pods.

The first node is drained by using the oc adm drain command. The node is cordoned and marked as unschedulable.

[user@host ~]$ oc adm drain node/node01 ...
node/node01 cordoned
Warning: ignoring DaemonSet-managed Pods: ...output omitted...
… request.go:682] Waited for … due to client-side throttling ...output omitted...
...output omitted...

Note

You can safely ignore the warnings about managed pods and client-side throttling.

All the pods that are running in the drained node are marked for eviction. The scheduler creates a replacement pod in another node and waits for the new pod to be ready. The second pod cannot be evicted, because the availability constraint states that five out of six pods must be running. The eviction operation for the second pod retries after five seconds.

...output omitted...
evicting pod scheduling-pdb/nginx-647b4c98f5-nqvs6
evicting pod scheduling-pdb/nginx-647b4c98f5-98ptm
error when evicting pods/"nginx-647b4c98f5-98ptm" -n "scheduling-pdb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
...output omitted...

Important

The drain process is blocked until all the pods are evicted from the node.

The first replacement pod is marked as ready, and the first pod is evicted from the drained node. The scheduler creates a second replacement pod in another node and waits for it to be ready.

...output omitted...
pod/nginx-647b4c98f5-nqvs6 evicted
evicting pod scheduling-pdb/nginx-647b4c98f5-98ptm
...output omitted...

The second replacement pod is marked as ready, and the second pod is evicted from the drained node.

...output omitted...
pod/nginx-647b4c98f5-98ptm evicted
...output omitted...

The compute node is finally marked as drained.

...output omitted...
node/node01 drained

The full output of the oc adm drain command is as follows:

[user@host ~]$ oc adm drain node/node01 --ignore-daemonsets --delete-emptydir-data
node/node01 cordoned
Warning: ignoring DaemonSet-managed Pods: ...output omitted...
… request.go:682] Waited for … due to client-side throttling ...output omitted...
...output omitted...
evicting pod scheduling-pdb/nginx-647b4c98f5-nqvs6
evicting pod scheduling-pdb/nginx-647b4c98f5-98ptm  
error when evicting pods/"nginx-647b4c98f5-98ptm" -n "scheduling-pdb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.  
...output omitted...
error when evicting pods/"nginx-647b4c98f5-98ptm" -n "scheduling-pdb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
pod/nginx-647b4c98f5-nqvs6 evicted
evicting pod scheduling-pdb/nginx-647b4c98f5-98ptm
...output omitted...
pod/nginx-647b4c98f5-98ptm evicted  
...output omitted...
node/node01 drained

	The pod is marked for eviction.
	The pod eviction is blocked until the availability constraints from the pod disruption budget are met.
	The pod is finally evicted from the drained node.

Note

You can safely ignore the warnings about managed pods and client-side throttling.

References

For more information about pod scheduling on OpenShift, refer to the Controlling Pod Placement onto Nodes (Scheduling) section in the Red Hat OpenShift Container Platform 4.14 Nodes documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/nodes/index#controlling-pod-placement-onto-nodes-scheduling

For more information about pod disruption budgets, refer to the Understanding How to Use Pod Disruption Budgets to Specify the Number of Pods That Must Be Up section in the Red Hat OpenShift Container Platform 4.14 Nodes documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/nodes/index#nodes-pods-pod-distruption-about_nodes-pods-configuring

For more information about pod disruption budgets, refer to the Pod Disruption Budgets section in the Red Hat OpenShift Container Platform 4.14 Post-installation Configuration documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/postinstallation_configuration/index#post-install-pod-disruption-budgets

Kubernetes - Assigning Pods to Nodes

Kubernetes - Disruptions

Kubernetes - Specifying a Disruption Budget for Your Application

Kubernetes Well-known Labels, Annotations and Taints

Discuss Red Hat OpenShift Administration III: Scaling Deployments in the Enterprise

Go to community

Welcome to Red Hat OpenShift Administration III: Scaling Kubernetes Deployments in the Enterprise!

Syed

12 wrz 2023

We are excited to launch a space dedicated to the Red Hat Training course Red Hat OpenShift Administration III: Scaling Kubernetes Deployments in the Enterprise! To gain the most value from this group - click the "Join Group" button in the upper right hand corner of the group home page.We encourage group members to collaborate in this group to discuss topics, ask questions, share best practices and tips, provide course feedback, and share their accomplishments as it relates to DO378.Read more about Red Hat OpenShift Administration III: Scaling Kubernetes Deployments in the Enterprise here.

Revision: do380-4.14-397a507