You can instruct the Kubernetes scheduler how to distribute the pods across the compute nodes in the cluster. The affinity rules enable the Kubernetes scheduler to set a preference to run the pods in a given cluster node, or to avoid running pods on a node where another set of pods is running.
Cluster administrators can set topology labels in the nodes to indicate the failure domain that they belong to. Application developers can configure the deployment resources to use the pod anti-affinity settings with custom topology labels to distribute the workloads across the failure domains.
Sometimes, the applications require multiple replicas to be available if a cluster disruption occurs. This type of requirement comes from load testing, which might indicate a need for a given number of replicas to give a suitable response time for users. The pod disruption budget resources set the minimum availability constraints to require that a given number or percentage of pods remain running even when a node is drained.
Pod affinity is a group of rules that are set in the pod specification. The scheduler uses the rules to place pods in the same cluster node where another workload is present, or to avoid placing pods in the same node where the pods from other workloads are running.
The pod affinity terms define a set of selectors to match the labels of other pods that are running in the cluster nodes. The scheduler can evaluate the pod affinity terms on a best-effort basis if the affinity setting is preferred, or require the conditions to be present to schedule the pod.
Kubernetes implements two types of pod affinity rules: pod affinity and pod anti-affinity.
The pod affinity setting enables scheduling of pods in the same node where the pods from another workload are running. The pod specification contains the pod affinity setting, which uses a selector to match the labels from other pods by using an operator.
Pod affinity helps reduce the network latency between specific pods. Pod affinity can help improve the performance of workloads that require significant communication between pods.
In the following image, the pods have a pod affinity setting, to direct the scheduler to place them in the same node. The deployment is configured to run two replicas.
The scheduler places the first pod in the first node.
No preference exists, because no pods match the | |
The scheduler looks for pods that match the
|
The scheduler places the first pod in the first node.
No preference exists, because no pods match the | |
The scheduler looks for pods that match the |
Similar to node affinity, you can apply the required or preferred affinity terms to the pod affinity or pod anti-affinity settings.
The scheduler evaluates the affinity terms only when the pod is created.
The pod affinity terms are enforced when the pod is created, and the scheduler does not move the pod to another node if the conditions change.
The pod affinity rules include a list of selectors to match the labels of pods that are running in a cluster node.
The scheduler evaluates the pod affinity terms on a best-effort basis. Thus, the scheduler might choose a node that does not satisfy all the requirements if no node matches all the conditions. The scheduler does not move the pod if the conditions change.
The preferred scheduling term has a list of weighted pod selector terms. Each term can match a label from running pods. Each rule has a weight value that is defined in the range 1-100 inclusive.
apiVersion: v1
kind: Pod
metadata:
name: my-pod
labels:
app: custom
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- custom
topologyKey: rack
...content omitted...The expressions are evaluated on a best-effort basis and the pod is scheduled in a node even if the constraints are not met. | |
The weight value is between 1-100 inclusive. | |
The expression that matches a node label. | |
The operator that matches the labels. | |
The topology key that the scheduler uses to distribute pods across the failure domains. |
The Kubernetes scheduler can use node labels to indicate the failure domain that each node belongs to. The failure domain can be a rack that is connected to a different power source or networking equipment, or an indicator that the node is in another building. Kubernetes uses the following node labels as default topology keys to distribute the workloads across the cluster nodes:
kubernetes.io/hostname
This label is set to match the hostname value of each node in the cluster.
topology.kubernetes.io/zone
Represents the availability zone of the node. This label is set by the cloud provider and might not be present in on-premise deployments.
topology.kubernetes.io/region
Represents the region of the node. This label is set by the cloud provider and might not be present in on-premise deployments.
The topology key helps the scheduler to spread the pods across the failure domains. The cluster administrator must label all the nodes to indicate their corresponding failure domain. Application developers can use the node label as a custom topology key in the pod affinity terms, to direct the scheduler how to place the pods in the cluster nodes.
The following scenario describes a cluster distributed between two racks. Each rack is connected to a different power source. The nodes are distributed by rack according to the following table:
| Topology Key | Control plane nodes | Compute nodes |
|---|---|---|
rack=rack-1
|
master02 , master03
|
node01
|
rack=rack-2
|
master01
|
node02 , node03
|
The administrator added the rack label to indicate the location and failure domain of each node.
This label is intended as a custom topology key, so that the scheduler can spread the pods evenly across the compute nodes in different racks.
The value of the kubernetes.io/hostname topology key is different on each node, and the scheduler identifies each node as a separate failure domain.
The scheduler uses the default topology keys, and distributes the pods evenly across all the compute nodes.
The rack topology key specifies the failure domain that each node belongs to.
The scheduler spreads pods evenly across the failure domains.
Three pods are placed in the | |
The other three pods are placed in the nodes in the |
The pod disruption budget is a policy object where you can specify the percentage or number of pods that must remain running when a voluntary disruption occurs in the cluster.
The pod disruption budget resource is part of the policy/v1 API group.
Kubernetes clusters have two types of disruptions: involuntary and voluntary.
Involuntary disruptions occur when a node fails and is disconnected from the cluster. The involuntary disruptions could be related to a hardware problem, a network issue within a zone, or a power issue in a rack. The Kubernetes cluster detects that a node is not responsive, creates replacement pods in another node, and eventually deletes all the pods that are scheduled on the failed node.
Voluntary disruptions occur when nodes are cordoned or taken offline for maintenance. When a voluntary disruption occurs, all the pods on the affected node are evicted at the same time, and the scheduler creates replacement pods in another node. If all the pods of a workload are running in the same node and that node is drained, then the minimum availability constraints for the application are not fulfilled.
You can define the pod disruption budget resource with either the minAvailable or the maxUnavailable attribute.
The values for these settings are often discovered after load testing on the application.
Sometimes, a given number of pods must run all the time to prevent the application from queuing the user requests.
Another example is that the application delivers an acceptable response time for the users even if only 60% of the pods are running.
minAvailable
The minimum number of pods to be available, even during a voluntary disruption. You can set this attribute to a percentage or integer value to specify the minimum number of replica pods to be available.
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: nginx-minavailable spec:minAvailable: 80%![]()
selector:![]()
matchLabels: app: nginx
maxUnavailable
The maximum number of pods that can be unavailable during a voluntary disruption. You can set this attribute to a percentage or integer value to specify the maximum number of replica pods that can be unavailable.
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: nginx-maxunavailable spec:maxUnavailable: 33%![]()
selector:![]()
matchLabels: app: nginx
You can list or describe the pod disruption budget resources to inspect their properties.
[user@host ~]$oc get pdb nginx-minavailableNAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE nginx-minavailable80%N/A 2 45s
You can also describe the resource to view the label selector for the affected pods.
[user@host ~]$oc describe pdb nginx-maxunavailableName: nginx-maxunavailable Namespace: scheduling-pdb Max unavailable:33%Selector:
app=nginxStatus: Allowed disruptions: 4 Current: 10 Desired: 6 Total: 10 Events: <none>
The maximum percentage or number of pods that can be unavailable. | |
The label selector for the affected pods. |
The deployment is configured to run six replica pods that are distributed between the compute nodes. The application has an appropriate response time if five or more pods are running.
The first node is drained by using the oc adm drain command.
The node is cordoned and marked as unschedulable.
[user@host ~]$ oc adm drain node/node01 ...
node/node01 cordoned
Warning: ignoring DaemonSet-managed Pods: ...output omitted...
… request.go:682] Waited for … due to client-side throttling ...output omitted...
...output omitted...You can safely ignore the warnings about managed pods and client-side throttling.
All the pods that are running in the drained node are marked for eviction. The scheduler creates replacement pods in another node. Only four pods are ready and the application starts responding slowly.
...output omitted... evicting pod scheduling-pdb/nginx-56bdf7d8c6-kf98k evicting pod scheduling-pdb/nginx-56bdf7d8c6-8kk2b pod/nginx-56bdf7d8c6-kf98k evicted pod/nginx-56bdf7d8c6-8kk2b evicted ...output omitted...
The compute node is finally marked as drained. The replacement pods are marked as ready. The application now runs all six pods and the response time is within user expectations.
...output omitted...
node/node01 drainedThe full output of the oc adm drain command is as follows:
[user@host ~]$oc adm drain node/node01 --ignore-daemonsets --delete-emptydir-datanode/node01 cordoned Warning: ignoring DaemonSet-managed Pods: ...output omitted... … request.go:682] Waited for … due to client-side throttling ...output omitted... ...output omitted... evicting pod scheduling-pdb/nginx-56bdf7d8c6-kf98kevicting pod scheduling-pdb/nginx-56bdf7d8c6-8kk2b ...output omitted... pod/nginx-56bdf7d8c6-kf98k evicted
pod/nginx-56bdf7d8c6-8kk2b evicted ...output omitted... node/node01 drained
With a pod disruption budget resource, you can specify the minimum availability constraints for the application. The node drain is blocked until the replacement pods for the application are scheduled in another node and become ready. The pod eviction is blocked until the pod disruption budget availability constraints are met. The pod eviction operation retries after five seconds.
The deployment is configured to run six replica pods that are distributed between the compute nodes.
The application has an appropriate response time if five or more pods are running.
The developer creates a pod disruption budget resource where the spec.minAvailable parameter is set to five, and specifies the selector to match the labels of the deployment pods.
The first node is drained by using the oc adm drain command.
The node is cordoned and marked as unschedulable.
[user@host ~]$oc adm drain node/node01 ...node/node01 cordonedWarning: ignoring DaemonSet-managed Pods: ...output omitted... … request.go:682] Waited for … due to client-side throttling ...output omitted... ...output omitted...
You can safely ignore the warnings about managed pods and client-side throttling.
All the pods that are running in the drained node are marked for eviction. The scheduler creates a replacement pod in another node and waits for the new pod to be ready. The second pod cannot be evicted, because the availability constraint states that five out of six pods must be running. The eviction operation for the second pod retries after five seconds.
...output omitted...
evicting pod scheduling-pdb/nginx-647b4c98f5-nqvs6
evicting pod scheduling-pdb/nginx-647b4c98f5-98ptm
error when evicting pods/"nginx-647b4c98f5-98ptm" -n "scheduling-pdb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
...output omitted...The drain process is blocked until all the pods are evicted from the node.
The first replacement pod is marked as ready, and the first pod is evicted from the drained node. The scheduler creates a second replacement pod in another node and waits for it to be ready.
...output omitted... pod/nginx-647b4c98f5-nqvs6 evicted evicting pod scheduling-pdb/nginx-647b4c98f5-98ptm ...output omitted...
The second replacement pod is marked as ready, and the second pod is evicted from the drained node.
...output omitted... pod/nginx-647b4c98f5-98ptm evicted ...output omitted...
The compute node is finally marked as drained.
...output omitted...
node/node01 drainedThe full output of the oc adm drain command is as follows:
[user@host ~]$oc adm drain node/node01 --ignore-daemonsets --delete-emptydir-datanode/node01 cordoned Warning: ignoring DaemonSet-managed Pods: ...output omitted... … request.go:682] Waited for … due to client-side throttling ...output omitted... ...output omitted... evicting pod scheduling-pdb/nginx-647b4c98f5-nqvs6 evicting pod scheduling-pdb/nginx-647b4c98f5-98ptmerror when evicting pods/"nginx-647b4c98f5-98ptm" -n "scheduling-pdb" (will retry after 5s):
Cannot evict pod as it would violate the pod's disruption budget....output omitted... error when evicting pods/"nginx-647b4c98f5-98ptm" -n "scheduling-pdb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. pod/nginx-647b4c98f5-nqvs6 evicted evicting pod scheduling-pdb/nginx-647b4c98f5-98ptm ...output omitted... pod/nginx-647b4c98f5-98ptm evicted
...output omitted... node/node01 drained
The pod is marked for eviction. | |
The pod eviction is blocked until the availability constraints from the pod disruption budget are met. | |
The pod is finally evicted from the drained node. |
You can safely ignore the warnings about managed pods and client-side throttling.
For more information about pod scheduling on OpenShift, refer to the Controlling Pod Placement onto Nodes (Scheduling) section in the Red Hat OpenShift Container Platform 4.14 Nodes documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/nodes/index#controlling-pod-placement-onto-nodes-scheduling
For more information about pod disruption budgets, refer to the Understanding How to Use Pod Disruption Budgets to Specify the Number of Pods That Must Be Up section in the Red Hat OpenShift Container Platform 4.14 Nodes documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/nodes/index#nodes-pods-pod-distruption-about_nodes-pods-configuring
For more information about pod disruption budgets, refer to the Pod Disruption Budgets section in the Red Hat OpenShift Container Platform 4.14 Post-installation Configuration documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/postinstallation_configuration/index#post-install-pod-disruption-budgets
Kubernetes - Assigning Pods to Nodes
Kubernetes - Specifying a Disruption Budget for Your Application