Bookmark this page

Guided Exercise: Surviving Node Failure with Virtual Machines

Configure a virtual machine to automatically fail over to another cluster node if the node that it runs on becomes unresponsive.

Outcomes

  • Monitor the connectivity to a web application from two VMs on a failed node.

  • Identify and drain the failed node that hosts the web application VMs.

  • Manually recover a VM from the failed node.

  • Adjust the eviction strategy of a VM.

  • Delete the node from the cluster.

  • Restart the node to rejoin the cluster.

As the student user on the workstation machine, use the lab command to prepare your environment for this exercise, and to ensure that all required resources are available.

[student@workstation ~]$ lab start ha-node

Instructions

The lab command creates the ha-node namespace and starts two virtual machines, web1 and web2, which host a web application.

This command also creates a service that load balances client requests between the two VMs, and a route resource for clients to access the web application.

  1. As the admin user, confirm that the two VMs are running in the ha-node project.

    1. Open a web browser and navigate to https://console-openshift-console.apps.ocp4.example.com. Select htpasswd_provider and log in as the admin user with redhatocp as the password.

    2. Navigate to VirtualizationVirtualMachines and then select the ha-node project. Confirm that the web1 and web2 VMs are running.

  2. Verify the eviction and run strategies of the web1 and web2 VMs.

    1. Select the web1 VM, and then navigate to the ConfigurationScheduling menu. Confirm that the eviction strategy is set to LiveMigrate.

    2. Navigate to the YAML tab to open the VM's manifest in the YAML editor. Within the YAML manifest, confirm that the .spec.runStrategy object is set to the RerunOnFailure run strategy.

      ...output omitted...
      spec:
      ...output omitted...
        runStrategy: RerunOnFailure
        template:
          metadata:
      ...output omitted...
    3. Navigate to VirtualizationVirtualMachines and select the web2 VM. Navigate to the ConfigurationScheduling menu. Confirm that the eviction strategy is set to None.

    4. Navigate to the YAML tab to open the VM's manifest in the YAML editor. Within the YAML manifest, confirm that the .spec.runStrategy object is set to the RerunOnFailure run strategy.

      ...output omitted...
      spec:
      ...output omitted...
        runStrategy: RerunOnFailure
        template:
          metadata:
      ...output omitted...
  3. Confirm that the www-web service endpoints resolve to the IP addresses of the web1 and web2 VMIs. Identify and monitor the node that runs the web1 and web2 VMIs.

    1. From a command-line window, log in to your Red Hat OpenShift cluster as the admin user with redhatocp as the password.

      [student@workstation ~]$ oc login -u admin -p redhatocp \
        https://api.ocp4.example.com:6443
      Login Successful
      ...output omitted...
    2. Change to the ha-node project.

      [student@workstation ~]$ oc project ha-node
      Now using project "ha-node" on server "https://api.ocp4.example.com:6443".
    3. Use the oc command to list the VMI resources in the ha-node project. Note the IP addresses of the www1 and www2 VM instances, and the node that hosts the VMIs. The IP addresses might differ in your environment.

      [student@workstation ~]$ oc get vmi
      NAME   AGE   PHASE     IP           NODENAME   READY
      web1   18m   Running   10.11.0.24   worker01   True
      web2   17m   Running   10.11.0.29   worker01   True
    4. Confirm that the www-web service has active endpoints that resolve to the IP addresses of the web1 and web2 VMIs.

      [student@workstation ~]$ oc get endpoints
      NAME      ENDPOINTS                     AGE
      www-web   10.11.0.24:80,10.11.0.29:80   18m
    5. Open a command-line window and execute the loop.sh file in the ~/DO316/labs/ha-node/ directory. The loop.sh file executes the curl command against the web-ha-node.apps.ocp4.example.com route. Leave the command running.

      [student@workstation ~]$ sh /home/student/DO316/labs/ha-node/loop.sh
      Welcome to web1
      Welcome to web2
      Welcome to web1
      ...output omitted...
    6. Open a command-line window and use the watch command to monitor the VMI's availability during this exercise. Leave the command running.

      [student@workstation ~]$ watch oc get vmi
      Every 2.0s: oc get vmi                    workstation.lab.example.com:...
      NAME   AGE   PHASE     IP           NODENAME   READY
      web1   24m   Running   10.11.0.24   worker01   True
      web2   22m   Running   10.11.0.29   worker01   True
  4. Developers report that resources on the node that runs the VMIs are experiencing performance and connectivity issues. These issues do not affect resources on other cluster nodes.

    As the cluster administrator, you suspect that the node is failing due to an incorrect configuration.

    Prevent new workloads from running on the node that runs the VMIs, and then drain the node of its current workloads. Manually recover and adjust the eviction strategy of a VM on the failed node.

    Then, power off the node and delete it from the cluster.

    1. On the workstation machine, open a command-line window and mark the worker01 node as not schedulable, with the oc adm cordon command.

      [student@workstation ~]$ oc adm cordon worker01
      node/worker01 cordoned
    2. Confirm that the node has the Ready,SchedulingDisabled status.

      [student@workstation ~]$ oc get node worker01
      NAME       STATUS                     ROLES    AGE     VERSION
      worker01   Ready,SchedulingDisabled   worker   4d16h   v1.27.10+28ed2d7
    3. Drain the node of its workloads.

      [student@workstation ~]$ oc adm drain worker01 --ignore-daemonsets=true \
        --delete-emptydir-data --force
      node/worker01 already cordoned
      ...output omitted...
      node/worker01 drained
    4. Monitor the command-line window that executes the loop.sh command. Observe the high availability of the web application.

      ...output omitted...
      Welcome to web1
      Welcome to web1
      Welcome to web1
      Welcome to web1
      Welcome to web1
      Welcome to web1
      ...output omitted...
    5. Monitor the command-line window that executes the watch command. Notice that OpenShift sets to the Succeeded value the phase for the web2 VMI, and that the VMI has a False ready status, because OpenShift shuts down the web2 VMI. Kubernetes cannot automatically relocate the web2 VM to a healthy node in the cluster, because the web2 VMI uses the None eviction strategy and you did not configure machine health checks.

      Every 2.0s: oc get vmi                    workstation.lab.example.com:...
      NAME   AGE   PHASE       IP           NODENAME   READY
      web1   36m   Running     10.8.2.55    worker02   True
      web2   34m   Succeeded   10.11.0.29   worker01   False
  5. Configure the web2 VM with the LiveMigrate eviction strategy, and then manually recover the VM from the failed node.

    1. From the web console, navigate to VirtualizationVirtualMachines. Select the web2 VMI and click the Configuration menu.

    2. Navigate to the Scheduling section, and click None in the Eviction strategy subsection. Click the LiveMigrate flag and click Save.

    3. Click ActionsStart to power on and reschedule the VM on another node in the cluster.

    4. Monitor the command-line window where the loop.sh command is running, and observe that the web1 VMI serves all the requests until the web2 VMI reaches the Running status.

      ...output omitted...
      Welcome to web1
      Welcome to web1
      Welcome to web1
      Welcome to web1
      Welcome to web2
      Welcome to web2
      Welcome to web1
      ...output omitted...
    5. Navigate to the command-line window where the watch command is running. The worker02 node hosts the web1 VMI, and the master02 node hosts the web2 VMI. Nodes might differ in your environment.

      Every 2.0s: oc get vmi                    workstation.lab.example.com:...
      NAME   AGE     PHASE     IP          NODENAME   READY
      web1   43m     Running   10.8.2.55   worker02   True
      web2   3m31s   Running   10.9.0.41   master02   True
  6. Delete the node from the cluster.

    1. To prevent potential data corruption, power off the drained node. From the Lab Environment page for this course, locate the drained host machine, click ACTION, and then click Power Off. Wait for the machine to display the Stopped status before proceeding.

    2. Return to the command-line window on the workstation machine where you initiated the node drain. Delete the drained node from the cluster.

      [student@workstation ~]$ oc delete node worker01
      node "worker01" deleted
    3. List the nodes to confirm that the deleted node is no longer available.

      [student@workstation ~]$ oc get nodes
      NAME       STATUS   ROLES                         AGE     VERSION
      master01   Ready    control-plane,master,worker   14d     v1.27.10+28ed2d7
      master02   Ready    control-plane,master,worker   14d     v1.27.10+28ed2d7
      master03   Ready    control-plane,master,worker   14d     v1.27.10+28ed2d7
      worker02   Ready    worker                        4d16h   v1.27.10+28ed2d7
  7. Instruct the deleted node to rejoin the cluster.

    1. From the Lab Environment page for this course, locate the deleted host machine, click ACTION, and then click Start. Wait for the machine to display the Active status before proceeding.

    2. From the command-line window on the workstation machine, list the nodes to confirm that the deleted node rejoined the cluster. It might take a few minutes for the node to display the Ready status.

      [student@workstation ~]$ oc get nodes
      NAME       STATUS   ROLES                         AGE     VERSION
      master01   Ready    control-plane,master,worker   14d     v1.27.10+28ed2d7
      master02   Ready    control-plane,master,worker   14d     v1.27.10+28ed2d7
      master03   Ready    control-plane,master,worker   14d     v1.27.10+28ed2d7
      worker01   Ready    worker                        3m7s    v1.27.10+28ed2d7
      worker02   Ready    worker                        4d16h   v1.27.10+28ed2d7
    3. In the command-line window that is executing the loop.sh command, press Ctrl+C to stop the command. Close the command-line window.

    4. In the command-line window that is executing the watch command, press Ctrl+C to stop the command. Close the command-line window.

Finish

On the workstation machine, use the lab command to complete this exercise. This step is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish ha-node

Revision: do316-4.14-d8a6b80