DO316 - ch08s06

Bookmark this page

Guided Exercise: Surviving Node Failure with Virtual Machines

Configure a virtual machine to automatically fail over to another cluster node if the node that it runs on becomes unresponsive.

Outcomes

Monitor the connectivity to a web application from two VMs on a failed node.
Identify and drain the failed node that hosts the web application VMs.
Manually recover a VM from the failed node.
Adjust the eviction strategy of a VM.
Delete the node from the cluster.
Restart the node to rejoin the cluster.

As the student user on the workstation machine, use the lab command to prepare your environment for this exercise, and to ensure that all required resources are available.

[student@workstation ~]$ lab start ha-node

Instructions

The lab command creates the ha-node namespace and starts two virtual machines, web1 and web2, which host a web application.

This command also creates a service that load balances client requests between the two VMs, and a route resource for clients to access the web application.

As the admin user, confirm that the two VMs are running in the ha-node project.
1. Open a web browser and navigate to https://console-openshift-console.apps.ocp4.example.com. Select htpasswd_provider and log in as the admin user with redhatocp as the password.
2. Navigate to Virtualization → VirtualMachines and then select the ha-node project. Confirm that the web1 and web2 VMs are running.
Verify the eviction and run strategies of the web1 and web2 VMs.
1. Select the web1 VM, and then navigate to the Configuration → Scheduling menu. Confirm that the eviction strategy is set to LiveMigrate.
2. Navigate to the YAML tab to open the VM's manifest in the YAML editor. Within the YAML manifest, confirm that the .spec.runStrategy object is set to the RerunOnFailure run strategy.
```
...output omitted...
spec:
...output omitted...
  runStrategy: RerunOnFailure
  template:
    metadata:
...output omitted...
```
3. Navigate to Virtualization → VirtualMachines and select the web2 VM. Navigate to the Configuration → Scheduling menu. Confirm that the eviction strategy is set to None.
4. Navigate to the YAML tab to open the VM's manifest in the YAML editor. Within the YAML manifest, confirm that the .spec.runStrategy object is set to the RerunOnFailure run strategy.
```
...output omitted...
spec:
...output omitted...
  runStrategy: RerunOnFailure
  template:
    metadata:
...output omitted...
```

Confirm that the www-web service endpoints resolve to the IP addresses of the web1 and web2 VMIs. Identify and monitor the node that runs the web1 and web2 VMIs.

From a command-line window, log in to your Red Hat OpenShift cluster as the admin user with redhatocp as the password.

[student@workstation ~]$ oc login -u admin -p redhatocp \
  https://api.ocp4.example.com:6443
Login Successful
...output omitted...

Change to the ha-node project.

[student@workstation ~]$ oc project ha-node
Now using project "ha-node" on server "https://api.ocp4.example.com:6443".

Use the oc command to list the VMI resources in the ha-node project. Note the IP addresses of the www1 and www2 VM instances, and the node that hosts the VMIs. The IP addresses might differ in your environment.
```
[student@workstation ~]$ oc get vmi
NAME   AGE   PHASE     IP           NODENAME   READY
web1   18m   Running   10.11.0.24   worker01   True
web2   17m   Running   10.11.0.29   worker01   True
```

Confirm that the www-web service has active endpoints that resolve to the IP addresses of the web1 and web2 VMIs.

[student@workstation ~]$ oc get endpoints
NAME      ENDPOINTS                     AGE
www-web   10.11.0.24:80,10.11.0.29:80   18m

Open a command-line window and execute the loop.sh file in the ~/DO316/labs/ha-node/ directory. The loop.sh file executes the curl command against the web-ha-node.apps.ocp4.example.com route. Leave the command running.
```
[student@workstation ~]$ sh /home/student/DO316/labs/ha-node/loop.sh
Welcome to web1
Welcome to web2
Welcome to web1
...output omitted...
```

Open a command-line window and use the watch command to monitor the VMI's availability during this exercise. Leave the command running.

[student@workstation ~]$ watch oc get vmi
Every 2.0s: oc get vmi                    workstation.lab.example.com:...
NAME   AGE   PHASE     IP           NODENAME   READY
web1   24m   Running   10.11.0.24   worker01   True
web2   22m   Running   10.11.0.29   worker01   True

Developers report that resources on the node that runs the VMIs are experiencing performance and connectivity issues. These issues do not affect resources on other cluster nodes.
As the cluster administrator, you suspect that the node is failing due to an incorrect configuration.
Prevent new workloads from running on the node that runs the VMIs, and then drain the node of its current workloads. Manually recover and adjust the eviction strategy of a VM on the failed node.
Then, power off the node and delete it from the cluster.
1. On the workstation machine, open a command-line window and mark the worker01 node as not schedulable, with the oc adm cordon command.
```
[student@workstation ~]$ oc adm cordon worker01
node/worker01 cordoned
```
2. Confirm that the node has the Ready,SchedulingDisabled status.
```
[student@workstation ~]$ oc get node worker01
NAME       STATUS                     ROLES    AGE     VERSION
worker01   Ready,SchedulingDisabled   worker   4d16h   v1.27.10+28ed2d7
```
3. Drain the node of its workloads.
```
[student@workstation ~]$ oc adm drain worker01 --ignore-daemonsets=true \
  --delete-emptydir-data --force
node/worker01 already cordoned
...output omitted...
node/worker01 drained
```
4. Monitor the command-line window that executes the loop.sh command. Observe the high availability of the web application.
```
...output omitted...
Welcome to web1
Welcome to web1
Welcome to web1
Welcome to web1
Welcome to web1
Welcome to web1
...output omitted...
```
5. Monitor the command-line window that executes the watch command. Notice that OpenShift sets to the Succeeded value the phase for the web2 VMI, and that the VMI has a False ready status, because OpenShift shuts down the web2 VMI. Kubernetes cannot automatically relocate the web2 VM to a healthy node in the cluster, because the web2 VMI uses the None eviction strategy and you did not configure machine health checks.
```
Every 2.0s: oc get vmi                    workstation.lab.example.com:...
NAME   AGE   PHASE       IP           NODENAME   READY
web1   36m   Running     10.8.2.55    worker02   True
web2   34m   Succeeded   10.11.0.29   worker01   False
```
Configure the web2 VM with the LiveMigrate eviction strategy, and then manually recover the VM from the failed node.
1. From the web console, navigate to Virtualization → VirtualMachines. Select the web2 VMI and click the Configuration menu.
2. Navigate to the Scheduling section, and click None in the Eviction strategy subsection. Click the LiveMigrate flag and click Save.
3. Click Actions → Start to power on and reschedule the VM on another node in the cluster.
4. Monitor the command-line window where the loop.sh command is running, and observe that the web1 VMI serves all the requests until the web2 VMI reaches the Running status.
```
...output omitted...
Welcome to web1
Welcome to web1
Welcome to web1
Welcome to web1
Welcome to web2
Welcome to web2
Welcome to web1
...output omitted...
```
5. Navigate to the command-line window where the watch command is running. The worker02 node hosts the web1 VMI, and the master02 node hosts the web2 VMI. Nodes might differ in your environment.
```
Every 2.0s: oc get vmi                    workstation.lab.example.com:...
NAME   AGE     PHASE     IP          NODENAME   READY
web1   43m     Running   10.8.2.55   worker02   True
web2   3m31s   Running   10.9.0.41   master02   True
```
Delete the node from the cluster.
1. To prevent potential data corruption, power off the drained node. From the Lab Environment page for this course, locate the drained host machine, click ACTION, and then click Power Off. Wait for the machine to display the Stopped status before proceeding.
2. Return to the command-line window on the workstation machine where you initiated the node drain. Delete the drained node from the cluster.
```
[student@workstation ~]$ oc delete node worker01
node "worker01" deleted
```
3. List the nodes to confirm that the deleted node is no longer available.
```
[student@workstation ~]$ oc get nodes
NAME       STATUS   ROLES                         AGE     VERSION
master01   Ready    control-plane,master,worker   14d     v1.27.10+28ed2d7
master02   Ready    control-plane,master,worker   14d     v1.27.10+28ed2d7
master03   Ready    control-plane,master,worker   14d     v1.27.10+28ed2d7
worker02   Ready    worker                        4d16h   v1.27.10+28ed2d7
```
Instruct the deleted node to rejoin the cluster.
1. From the Lab Environment page for this course, locate the deleted host machine, click ACTION, and then click Start. Wait for the machine to display the Active status before proceeding.
2. From the command-line window on the workstation machine, list the nodes to confirm that the deleted node rejoined the cluster. It might take a few minutes for the node to display the Ready status.
```
[student@workstation ~]$ oc get nodes
NAME       STATUS   ROLES                         AGE     VERSION
master01   Ready    control-plane,master,worker   14d     v1.27.10+28ed2d7
master02   Ready    control-plane,master,worker   14d     v1.27.10+28ed2d7
master03   Ready    control-plane,master,worker   14d     v1.27.10+28ed2d7
worker01   Ready    worker                        3m7s    v1.27.10+28ed2d7
worker02   Ready    worker                        4d16h   v1.27.10+28ed2d7
```
3. In the command-line window that is executing the loop.sh command, press Ctrl+C to stop the command. Close the command-line window.
4. In the command-line window that is executing the watch command, press Ctrl+C to stop the command. Close the command-line window.

Finish

On the workstation machine, use the lab command to complete this exercise. This step is important to ensure that resources from previous exercises do not impact upcoming exercises.

[student@workstation ~]$ lab finish ha-node

Discuss Managing Virtual Machines with Red Hat OpenShift Virtualization

Go to community

Welcome to the Managing Virtual Machines with Red Hat OpenShift (DO316) course in RHLC!

Syed

11 sie 2023

We are excited to launch a space dedicated to the Red Hat Training course, Managing Virtual Machines with Red Hat OpenShift Virtualization! To gain the most value from this group - click the "Join Group" button in the upper right hand corner of the group home page.We encourage group members to collaborate in this group to discuss topics, ask questions, share best practices and tips, provide course feedback, and share their accomplishments as it relates to DO316.Read more about Managing Virtual Machines with Red Hat OpenShift Virtualization here.

2372

Revision: do316-4.14-d8a6b80