Bookmark this page

Guided Exercise: High Availability with Affinity Rules and Pod Disruption Budgets

Configure a workload to spread its pods between nodes from different failure domains and set minimum availability requirements. Then, drain a node to simulate a cluster update, and prove that the application keeps minimum capacity and availability.

Outcomes

  • Add pod anti-affinity settings to a deployment resource manifest to spread the pods across the failure domains.

  • Create a pod disruption budget to set a minimum availability constraint that is fulfilled when the cluster has a voluntary disruption.

  • Drain a compute node to simulate a voluntary disruption.

As the student user on the workstation machine, use the lab command to prepare your environment for this exercise.

[student@workstation ~]$ lab start scheduling-pdb

Instructions

The company has a local OpenShift cluster that is distributed between two racks. Each rack is connected to a different power source. The nodes are distributed by rack according to the following table:

Rack Control plane nodes Compute nodes
rack-a master01 , master02 worker03
rack-b master03 worker01 , worker02

The administrator needs to drain the compute nodes for maintenance. The administrator added the rack label to indicate the location and failure domain of each node. This label is intended as a custom topology key, so that the scheduler can spread the pods evenly across the compute nodes in different racks.

The application runs six pods and requires five of them to be available if a voluntary cluster disruption occurs, to achieve the intended response time.

The developer modifies the deployment resource to add the pod anti-affinity settings that use the custom topology, and also creates a pod disruption budget to indicate the minimum availability constraint of the application.

  1. Verify that the nodes are labeled according to their rack location.

    1. Log in to the cluster as the admin user.

      [student@workstation ~]$ oc login -u admin -p redhatocp \
        https://api.ocp4.example.com:6443
      Login successful.
      
      ...output omitted...
    2. Verify that all the nodes have the rack labels according to the previous table.

      [student@workstation ~]$ oc get nodes -L rack
      NAME      STATUS  ROLES                 AGE  VERSION    RACK
      master01  Ready   control-plane,master  8d   v1.27.6+…  rack-a
      master02  Ready   control-plane,master  8d   v1.27.6+…  rack-a
      master03  Ready   control-plane,master  8d   v1.27.6+…  rack-b
      worker01  Ready   worker                7d   v1.27.6+…  rack-b  1
      worker02  Ready   worker                7d   v1.27.6+…  rack-b  2
      worker03  Ready   worker                7d   v1.27.6+…  rack-a  3

      1 2

      The worker01 and worker02 compute nodes are placed in the rack-b rack.

      3

      The worker03 compute node is placed in the rack-a rack.

  2. Create the deployment without pod affinity or a pod disruption budget.

    1. Log in as the developer user and verify that you are using the scheduling-pdb project.

      [student@workstation ~]$ oc login -u developer -p developer
      Login successful.
      
      You have one project on this server: "scheduling-pdb"
      
      Using project "scheduling-pdb".
    2. Change to the ~/DO380/labs/scheduling-pdb directory.

      [student@workstation ~]$ cd ~/DO380/labs/scheduling-pdb
    3. Create the deployment by using the YAML resource manifest.

      [student@workstation scheduling-pdb]$ oc apply -f deployment.yaml
      deployment.apps/nginx created
    4. Open a new terminal window, and then execute the following command to see the status of the pod disruption budget, deployment, and pods.

      Wait until all pods are running, and verify that all the pods from the nginx deployment are marked as ready and available.

      This process might take a few minutes.

      [student@workstation scheduling-pdb]$ watch oc get pdb,deployments,pods -o wide
      Every 2.0s: oc get pdb,deployments,pods ... workstation: Wed Jan  3 15:59:55 2024
      
      NAME                    READY   UP-TO-DATE   AVAILABLE   AGE   ...
      deployment.apps/nginx   6/6     6            6           60s   ...
      
      NAME                        READY  STATUS   RESTARTS  AGE  IP   NODE      ...
      pod/nginx-5676948d76-75l7z  1/1    Running  0         60s  ...  worker01  ...
      pod/nginx-5676948d76-pkdr7  1/1    Running  0         60s  ...  worker01  ...
      pod/nginx-5676948d76-gcst5  1/1    Running  0         60s  ...  worker01  ...
      pod/nginx-5676948d76-njzv2  1/1    Running  0         60s  ...  worker02  ...
      pod/nginx-5676948d76-94zmk  1/1    Running  0         60s  ...  worker03  ...
      pod/nginx-5676948d76-mcdbj  1/1    Running  0         60s  ...  worker03  ...

      Note

      Keep this terminal window open to view the status of the resources from this exercise.

    5. Return to the first terminal window and count the pods that are running on each compute node by using the count-pods.sh shell script.

      The replica pods are distributed across the cluster nodes but are not distributed evenly between the rack-a and rack-b failure domains.

      [student@workstation scheduling-pdb]$ ./count-pods.sh
      NODE            PODS
      worker01        3  1
      worker02        1  2
      worker03        2  3

      1 2

      The worker01 and worker02 compute nodes are placed in the rack-b rack.

      3

      The worker03 compute node is placed in the rack-a rack.

      Note

      Although the exact number of pods that are running on each node might be different, the total replica count is six pods.

  3. Simulate a voluntary disruption where the cluster administrator takes the worker01 node offline for maintenance.

    Important

    The selected node for draining must have at least two pods running.

    1. Log in as the admin user.

      [student@workstation scheduling-pdb]$ oc login -u admin -p redhatocp
      Login successful.
      
      ...output omitted...
    2. Drain the worker01 node to simulate taking it offline for maintenance.

      This command might take a few minutes to complete. Leave it running and continue with the next step. You review the output of this command in a later step.

      [student@workstation scheduling-pdb]$ oc adm drain node/worker01 \
        --ignore-daemonsets --delete-emptydir-data
      
      ...output omitted...
    3. Switch to the second terminal window to view the eviction of the pods from the drained node. Wait until all pods are running in another node and are marked as ready. This process might take a few minutes.

      All the application pods in the drained node are evicted at the same time and the minimum availability constraints are not met. Use the values in the age column to determine which pods were evicted from the drained node and were scheduled in a different node.

      This situation happens because no pod disruption budget is associated with the deployment pods, and the deployment resource also does not have an affinity setting that uses the rack label as a custom topology key.

      Every 2.0s: oc get pdb,deployments,pods ... workstation: Wed Jan  3 16:02:33 2024
      
      NAME                   READY  UP-TO-DATE  AVAILABLE  AGE  ...
      deployment.apps/nginx  3/6    6           3          14m  ...  1
      
      NAME                        READY  STATUS    ...  AGE  IP   NODE      ...
      pod/nginx-5676948d76-njzv2  1/1    Running   ...  3m   ...  worker02  ...
      pod/nginx-5676948d76-94zmk  1/1    Running   ...  3m   ...  worker03  ...
      pod/nginx-5676948d76-mcdbj  1/1    Running   ...  3m   ...  worker03  ...
      pod/nginx-5676948d76-hfbv9  0/1    Init:0/1  ...  1s   ...  worker02  ...  2
      pod/nginx-5676948d76-zxxlg  0/1    Init:0/1  ...  1s   ...  worker02  ...
      pod/nginx-5676948d76-dh6dh  0/1    Init:0/1  ...  1s   ...  worker03  ...

      1

      Only three replica pods are available.

      2

      Three replica pods are evicted from the drained compute node.

      Note

      Although the exact number of pods that are running on each node might be different, the total replica count is six pods.

    4. Return to the first terminal window and inspect the output of the oc adm drain command.

      Observe the pod eviction messages of the nginx pods. All the application pods in the drained node are evicted at the same time and the minimum availability constraint is not met.

      [student@workstation scheduling-pdb]$ oc adm drain node/worker01 \
        --ignore-daemonsets --delete-emptydir-data
      node/worker01 cordoned
      Warning: ignoring DaemonSet-managed Pods: ...output omitted...
      ...output omitted...
      I1221 21:29:52.102938  111157 request.go:696] ...output omitted...
      ...output omitted...
      evicting pod scheduling-pdb/nginx-5676948d76-pkdr7  1
      evicting pod scheduling-pdb/nginx-5676948d76-75l7z
      evicting pod scheduling-pdb/nginx-5676948d76-gcst5
      ...output omitted...
      pod/nginx-5676948d76-gcst5 evicted  2
      pod/nginx-5676948d76-75l7z evicted
      pod/nginx-5676948d76-pkdr7 evicted
      ...output omitted...
      node/worker01 drained

      1

      All the pods are marked for eviction when the node is drained.

      2

      All the pods are evicted from the node at the same time and the application availability constraint is not met.

      Note

      You can safely ignore the warnings about managed pods and client-side throttling.

    5. Get the state of the nodes to verify that the drained node is marked as not schedulable.

      [student@workstation scheduling-pdb]$ oc get nodes
      NAME       STATUS                     ROLES                  ...
      master01   Ready                      control-plane,master   ...
      master02   Ready                      control-plane,master   ...
      master03   Ready                      control-plane,master   ...
      worker01   Ready,SchedulingDisabled   worker                 ...  1
      worker02   Ready                      worker                 ...
      worker03   Ready                      worker                 ...

      1

      The compute node is drained for maintenance.

    6. Count the pods that are running on each compute node. The scheduler placed replacement pods for the evicted pods in the worker02 and worker03 compute nodes.

      [student@workstation scheduling-pdb]$ ./count-pods.sh
      NODE            PODS
      worker01        0  1
      worker02        3
      worker03        3

      1

      No pods are on this node, because it was just drained.

    7. Delete the nginx deployment.

      [student@workstation scheduling-pdb]$ oc delete deployment/nginx
      deployment.apps "nginx" deleted
    8. Uncordon the worker01 node that you drained previously to remove the SchedulingDisabled status.

      [student@workstation scheduling-pdb]$ oc adm uncordon node/worker01
      node/worker01 uncordoned
    9. List the cluster nodes and verify that all the compute nodes are marked as ready.

      [student@workstation ~]$ oc get nodes -L rack
      NAME      STATUS  ROLES                 AGE  VERSION    RACK
      master01  Ready   control-plane,master  8d   v1.27.6+…  rack-a
      master02  Ready   control-plane,master  8d   v1.27.6+…  rack-a
      master03  Ready   control-plane,master  8d   v1.27.6+…  rack-b
      worker01  Ready   worker                7d   v1.27.6+…  rack-b
      worker02  Ready   worker                7d   v1.27.6+…  rack-b
      worker03  Ready   worker                7d   v1.27.6+…  rack-a
  4. Create the nginx deployment with pod anti-affinity to spread the pods evenly across the compute nodes.

    1. Log in as the developer user.

      [student@workstation scheduling-pdb]$ oc login -u developer -p developer
      Login successful.
      
      ...output omitted...
    2. Edit the deployment-affinity.yaml file and set the affinity properties according to the following specification. Then, save and close the file.

      ...output omitted...
      spec:
        ...output omitted...
        template:
          ...output omitted...
          spec:
            ...output omitted...
            containers:
              ...output omitted...
            affinity:
              podAntiAffinity:
                preferredDuringSchedulingIgnoredDuringExecution:  1
                - weight: 100
                  podAffinityTerm:
                    topologyKey: rack  2
                    labelSelector:  3
                      matchExpressions:
                      - key: app
                        operator: In
                        values:
                        - nginx

      1

      The weighted pod affinity term is evaluated only during pod scheduling, on a best-effort basis.

      2

      The node label that indicates the failure domain for the nodes.

      3

      The label to select the pods that this affinity setting affects.

      Note

      The ~/DO380/solutions/scheduling-pdb/deployment-affinity.yaml file contains the correct configuration, and you can use it for comparison.

    3. Create the application deployment resource by using the YAML manifest.

      [student@workstation scheduling-pdb]$ oc apply -f deployment-affinity.yaml
      deployment.apps/nginx created
    4. Switch to the second terminal window. Wait until all pods are running and verify that all the pods from the nginx deployment are marked as ready and available.

      This process might take a few minutes.

      Every 2.0s: oc get pdb,deployments,pods ... workstation: Wed Jan  3 16:24:11 2024
      
      NAME                    READY   UP-TO-DATE   AVAILABLE   AGE   ...
      deployment.apps/nginx   6/6     6            6           120s  ...
      
      NAME                       READY  STATUS   RESTARTS  AGE   IP   NODE      ...
      pod/nginx-d5b9c7498-5hbkw  1/1    Running  0         99s   ...  worker01  ...
      pod/nginx-d5b9c7498-nkx9j  1/1    Running  0         99s   ...  worker01  ...
      pod/nginx-d5b9c7498-g6ztb  1/1    Running  0         99s   ...  worker02  ...
      pod/nginx-d5b9c7498-bk7g6  1/1    Running  0         99s   ...  worker03  ...
      pod/nginx-d5b9c7498-djn8p  1/1    Running  0         99s   ...  worker03  ...
      pod/nginx-d5b9c7498-pz8g5  1/1    Running  0         99s   ...  worker03  ...
    5. Return to the first terminal window and count the pods that are running on each compute node.

      The pods are evenly distributed across the racks, because of the pod anti-affinity settings.

      • Three pods are running in the rack-b rack nodes.

      • Three pods are running in the rack-a rack nodes.

      [student@workstation scheduling-pdb]$ ./count-pods.sh
      NODE            PODS
      worker01        2  1
      worker02        1  2
      worker03        3  3

      1 2

      The worker01 and worker02 compute nodes are in the rack-b rack.

      3

      The worker03 compute node is in the rack-a rack.

  5. Create the pod disruption budget with the intended constraints.

    1. Edit the pod-disruption-budget.yaml file and set the minimum available percentage and the label selector according to the following specification. Then, save and close the file.

      apiVersion: policy/v1
      kind: PodDisruptionBudget
      metadata:
        name: nginx
        labels:
          app: nginx
      spec:
        minAvailable: 80%
        selector:
          matchLabels:
            app: nginx

      Note

      The ~/DO380/solutions/scheduling-pdb/pod-disruption-budget.yaml file contains the correct configuration, and you can use it for comparison.

    2. Create the pod disruption budget by using the YAML manifest.

      [student@workstation scheduling-pdb]$ oc apply -f pod-disruption-budget.yaml
      poddisruptionbudget.policy/nginx created
    3. Verify that the nginx pod disruption budget was created, and that it has the intended minimum available attribute.

      [student@workstation scheduling-pdb]$ oc describe pdb nginx
      Name:           nginx
      Namespace:      scheduling-pdb
      Min available:  80%
      Selector:       app=nginx
      Status:
          Allowed disruptions:  1  1
          Current:              6
          Desired:              5
          Total:                6
      Events:                   <none>

      1

      Only one pod can be evicted at a time from a drained node.

  6. Drain a compute node to simulate a voluntary disruption.

    Important

    The selected node for draining must have at least two pods running.

    1. Log in again as the admin user.

      [student@workstation scheduling-pdb]$ oc login -u admin -p redhatocp
      Login successful.
      
      ...output omitted...
    2. Drain the worker03 node to simulate taking it offline for maintenance.

      This command might take a few minutes to complete. Leave it running and continue with the next step. You review the output of this command in a later step.

      [student@workstation scheduling-pdb]$ oc adm drain node/worker03 \
        --ignore-daemonsets --delete-emptydir-data
      
      ...output omitted...
    3. Switch to the second terminal window to view the eviction of the pods from the drained node. Wait until all pods are running in another node and are marked as ready. This process might take a few minutes.

      One pod is evicted at a time from the drained node and the availability constraints are met. Use the values in the age column to determine which pods were evicted from the drained node and were scheduled in a different node.

      Every 2.0s: oc get pdb,deployments,pods ... workstation: Wed Jan  3 16:28:12 2024
      
      NAME                              MIN AVAIL…  MAX UNAVAIL…  ALLOWED DISRUPTIONS
      poddisruptionbudget.policy/nginx  80%         N/A           1
      
      NAME                    READY   UP-TO-DATE   AVAILABLE   AGE   ...
      deployment.apps/nginx   5/6     6            5           30m   ...  1
      
      NAME                       READY  STATUS    RESTARTS  AGE  IP   NODE      ...
      pod/nginx-d5b9c7498-5hbkw  1/1    Running   0         5m   ...  worker01  ...
      pod/nginx-d5b9c7498-nkx9j  1/1    Running   0         5m   ...  worker01  ...
      pod/nginx-d5b9c7498-g6ztb  1/1    Running   0         5m   ...  worker02  ...
      pod/nginx-d5b9c7498-pz8g5  1/1    Running   0         5m   ...  worker03  ...  2
      pod/nginx-d5b9c7498-q6rcf  1/1    Running   0         50s  ...  worker01  ...  3
      pod/nginx-d5b9c7498-pxx86  0/1    Init:0/1  0         10s  ...  worker02  ...  4
      
      ^C

      1

      The pod eviction follows the pod disruption budget.

      2

      The pods in the drained node continue to run until the scheduler evicts them.

      3

      The replacement pods are scheduled in another compute node.

      4

      Only one pod is evicted at a time from the drained node.

      Press Ctrl+C and close the second terminal window when done.

    4. Return to the first terminal window and inspect the output of the oc adm drain command.

      From the pod eviction messages of the nginx pods, observe that one pod is evicted at a time from the drained node and the availability constraints are met.

      The pod eviction is blocked until the PDB availability constraints are met. The pod eviction operation is retried after five seconds.

      [student@workstation scheduling-pdb]$ oc adm drain node/worker03 \
        --ignore-daemonsets --delete-emptydir-data
      node/worker03 cordoned
      Warning: ignoring DaemonSet-managed Pods: ...output omitted...
      ...output omitted...
      I0103 16:27:16.659505   29741 request.go:696] Waited for … due to client-side throttling, not priority and fairness, request: ...output omitted...
      ...output omitted...
      evicting pod scheduling-pdb/nginx-d5b9c7498-bk7g6
      evicting pod scheduling-pdb/nginx-d5b9c7498-djn8p  1
      evicting pod scheduling-pdb/nginx-d5b9c7498-pz8g5
      ...output omitted...
      pod/nginx-d5b9c7498-bk7g6 evicted
      ...output omitted...
      error when evicting pods/"nginx-d5b9c7498-djn8p" -n "scheduling-pdb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.  2
      error when evicting pods/"nginx-d5b9c7498-pz8g5" -n "scheduling-pdb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      ...output omitted...
      pod/nginx-d5b9c7498-djn8p evicted  3
      evicting pod scheduling-pdb/nginx-d5b9c7498-pz8g5
      error when evicting pods/"nginx-d5b9c7498-pz8g5" -n "scheduling-pdb" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
      ...output omitted...
      pod/nginx-d5b9c7498-pz8g5 evicted
      node/worker03 drained

      1

      The pod is marked for eviction.

      2

      The pod eviction is blocked until the PDB availability constraints are met.

      3

      The pod is finally evicted from the drained node.

      Note

      You can safely ignore the warnings about managed pods and client-side throttling.

    5. List the cluster nodes and verify that the worker01 node status is SchedulingDisabled.

      [student@workstation scheduling-pdb]$ oc get nodes
      NAME       STATUS                     ROLES                  AGE   VERSION
      master01   Ready                      control-plane,master   28d   v1.27.6+...
      master02   Ready                      control-plane,master   28d   v1.27.6+...
      master03   Ready                      control-plane,master   28d   v1.27.6+...
      worker01   Ready                      worker                 8d    v1.27.6+...
      worker02   Ready                      worker                 8d    v1.27.6+...
      worker03   Ready,SchedulingDisabled   worker                 8d    v1.27.6+...  1

      1

      The compute node is marked as not schedulable.

    6. Count the pods that are running on each compute node.

      [student@workstation scheduling-pdb]$ ./count-pods.sh
      NODE            PODS
      worker01        3  1
      worker02        3  2
      worker03        0  3

      1 2

      The pods are evenly distributed on the remaining nodes.

      3

      No pods are on this node, because it was just drained.

    7. Switch to the student HOME directory.

      [student@workstation scheduling-pdb]$ cd
      [student@workstation ~]$
  7. Optional: Clean up the resources that were used in this exercise.

    1. Delete the scheduling-pdb project.

      [student@workstation ~]$ oc delete project scheduling-pdb
      project.project.openshift.io "scheduling-pdb" deleted
    2. Uncordon all the compute nodes.

      [student@workstation ~]$ oc adm uncordon -l node-role.kubernetes.io/worker
      ...output omitted...
    3. Remove the rack label from all nodes.

      [student@workstation ~]$ oc label node --all rack-
      ...output omitted...

Finish

On the workstation machine, use the lab command to complete this exercise. This step is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish scheduling-pdb

Revision: do380-4.14-397a507