Configure Kubernetes resources to help VMs fail over to another cluster node when node failure is detected.
OpenShift schedules VMs to healthy nodes according to the node placement rules and to the presence or absence of scheduler profiles. The OpenShift built-in scheduler identifies the most suitable node for the VMs when you create them. The default scheduler meets the needs of most OpenShift users.
However, in some situations, OpenShift administrators might want more control over which node is used for VM placement. You can use the OpenShift advanced scheduling features to configure VMs to run on a particular node or alongside a specific VM.
OpenShift includes the following advanced scheduling features:
OpenShift schedules VMs on nodes that contain the label with the key-value pairs that are specified in the .spec.nodeSelector field of the VM's manifest.
OpenShift supports pod affinity, pod anti-affinity, and node affinity rules for VM scheduling. Affinity rules use more detailed syntax to schedule VMs on nodes, such as to specify whether a rule is a hard requirement or a preference. For example, if you specify that a rule is a hard requirement, then OpenShift does not schedule the VM if the rule is not satisfied. If a rule is a preference, then OpenShift still schedules the VM on the node even if the constraints of the rule are not met.
OpenShift schedules VMs with node affinity rules on nodes that contain the specified key-value labels in the .spec.affinity.nodeAffinity field of the VM's manifest.
If a node does not contain the specified key-value labels, then OpenShift does not schedule the VM to that node.
With node affinity, the key-value labels of other pods do not affect the scheduling of VMs on a node.
In contrast to node affinity, pod affinity and anti-affinity rules rely on the key-value labels of other pods to determine VM scheduling.
For example, you can create a pod anti-affinity rule to prevent scheduling a VM on a node if any pods on that node contain a specified key-value label.
Pod affinity rules work for VMs because the VirtualMachine workload type is based on the Pod object.
Affinity rules apply only during scheduling. If a running VM workload no longer meets a constraint, then OpenShift does not reschedule the VM workload to other nodes.
If a node has a taint, then OpenShift schedules a VM to that node only if the VM has a toleration that matches the taint. Otherwise, the node does not schedule the workload.
The OpenShift scheduler profile controls how OpenShift schedules pods onto nodes. The following scheduler profiles are available:
LowNodeUtilization: Attempts to spread pods evenly across nodes for low resource usage per node.
HighNodeUtilization: Attempts to place as many pods onto as few nodes as possible.
This approach minimizes the node count and creates high resource usage per node.
NoScoring: A low latency profile that strives for the quickest scheduling cycle by disabling all score plug-ins.
Nodes fail for different reasons, such as insufficient disk space or package issues. If a node enters a failing state, whether the node is a control plane or a compute node, then eviction strategies determine the appropriate actions to take with orphaned resources, such as pods, persistent volumes (PVs), and virtual machines (VMs).
When a worker node enters the NotReady state after failing health checks, OpenShift reschedules the remaining pods on the node to healthy nodes with the Ready status.
Eviction strategies determine whether OpenShift moves the VMs on the failed node to another node or terminates them. You can configure the following eviction strategies for VMs:
LiveMigrate: OpenShift migrates live, to ensure that the VM is not interrupted if you place the node into maintenance or if you drain it.
This eviction strategy is the default.
Non-migrateable VMs with this eviction strategy might prevent nodes from draining or might block a node upgrade because OpenShift does not evict the VM from the node, and you must shut down manually the VM.
LiveMigrateIfPossible: If users do not request a live migration on VMs, then OpenShift terminates non-migrateable VMs during evictions.
None: OpenShift does not migrate the VM, and it restarts or terminates the VM depending on the run strategy.
VMs with a live migration strategy must have a persistent volume claim (PVC) with a shared ReadWriteMany (RWX) access mode.
You can also configure an eviction strategy at the cluster level.
Configuring a cluster eviction strategy is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements and might not be functionally complete. Red Hat does not recommend using Technology Preview features in production.
The virt-handler DaemonSet communicates with the libvirtd instance to define VMs.
VMs are scheduled only on nodes from which the control plane received a virt-handler heartbeat.
It might take up to five minutes for the virt-handler DaemonSet and Kubernetes to detect the failure.
After that time, the control plane nodes mark the node as unschedulable, and then migrate the node's workloads according to the resource's node placement rules and the scheduling profile.
If the virt-handler DaemonSet loses the connection to the cluster's API server, then the node cannot communicate its status.
The node enters a failed state, and the remaining VMs cannot migrate to the healthy nodes.
By default, RHOCP includes a set of monitoring tools that periodically read the status of the cluster services and the created resources, to enable monitoring your VM instances. You can monitor the state of your VMs by using health checks and event logs.
Monitoring tools are updated during OpenShift cluster updates.
You can configure a watchdog device to verify the state of the VM's guest OS, and act according to the run strategy. The VM's watchdog service monitors only for guest OS failures, and it does not detect application failures.
You can configure the following actions for a watchdog device to proceed when the guest OS is unresponsive:
poweroff
The VM powers off immediately.
If the spec.running Boolean is set to true, or if the spec.runStrategy parameter is not set to manual, then the VM reboots.
reset
The VM reboots in place, and the guest OS cannot react.
shutdown
The VM gracefully powers off by stopping all services.
The reset option can cause liveness probes to time out due to the reboot, and then OpenShift marks the VM's pod as failed. Depending on the run strategy, a VM can be deleted or restarted on the same node or on another node.
The run strategies and installing the watchdog device are explained in more detail elsewhere in this course.
You can monitor the status of a VM's application and take the necessary actions with health checks. A health check periodically verifies the availability of a VM's application by using readiness and liveness probes. You can use either a readiness probe or a liveness probe, or both at the same time.
Readiness probes determine whether the application is ready to serve requests. If a readiness probe fails, then Kubernetes prevents client traffic from reaching the application by removing the VM's IP address from the service resource.
Liveness probes determine whether the application is in a healthy state. If a liveness probe detects an unhealthy state, then OpenShift Virtualization deletes the VMI resource and redeploys a new instance.
A watchdog device detects and restarts unresponsive operating systems, but does not directly detect application failures. For example, a watchdog does not trigger when a VM's application is unresponsive but the OS is still responding.
Do not use watchdogs for applications that you can monitor by using liveness probes. A liveness probe indirectly detects an unresponsive OS, because all the applications that are running in an unresponsive OS also become unresponsive. Use watchdog monitoring only when you have no other way to test your application.
The liveness, readiness, and watchdog device health checks are explained in greater detail elsewhere in this course.
Machine health checks automatically remediate an unhealthy machine, which is the host for a node, if the machine exists in a particular machine pool.
You can use machine health checks to monitor the health of a host by creating a resource that defines the condition to verify, the label for the set of hosts to monitor, and the remediation process to use.
A bare metal machine controller that observes a MachineHealthCheck resource checks for the defined condition.
If the condition is detected, then the controller initiates the remediation process.
Machine health checks are available only for clusters that are installed as bare-metal installer-provisioned infrastructure (IPI).
Depending on the condition, OpenShift removes or reboots the host.
If a host fails the health check, then the bare metal machine controller drains the remaining VMs on the node and reschedules the workloads to healthy nodes that the machine set owns. This rescheduling action follows the node placement rules, the scheduling profile, and the eviction strategies that are configured for the VMs. After OpenShift drains all the VMs and the node registers itself again to the cluster, the bare metal machine controller restores the annotations and labels from the unhealthy node to the new node.
To limit the disruptive impact of the host deletion, the controller drains and deletes one node at a time.
If multiple unhealthy hosts exceed the maxUnhealthy specified threshold for the targeted pool of hosts, then remediation stops and requires manual intervention.
Consider some limitations when using machine health checks:
A machine health check can remediate only hosts that a machine set owns.
If you remove the node for a host from the cluster, then the machine health check identifies that the host is unhealthy and remediates it immediately.
A machine health check remediates a host immediately if the Machine resource enters the Failed status.
If the node for a machine does not join the cluster after the configured nodeStartupTimeout parameter, then the machine health check remediates the machine.
For cloud environments, a machine health check relies on cloud provider integration for the machine to forcibly reboot, reprovision, and rejoin the cluster.
Instead of reprovisioning the nodes, power-based remediation uses a power controller to power off an inoperable node.
This type of remediation is also called power fencing.
OpenShift uses the MachineHealthCheck controller to detect faulty bare metal nodes.
Power-based remediation is fast and reboots faulty nodes instead of removing them from the cluster.
Power-based remediation provides the following capabilities:
Enables the recovery of control plane nodes.
Reduces the risk of data loss in hyperconverged environments.
Reduces the associated downtime with recovering physical hosts.
If the power operations do not complete, then the bare metal machine controller triggers reprovisioning the unhealthy node, except for a control plane node or an externally provisioned node.
If the node has a Baseboard Management Controller (BMC), then BMC credentials and network access to the BMC interface on the node are required.
Available in non-IPI bare metal clusters, the Self Node Remediation Operator runs on the cluster nodes and reboots unhealthy nodes.
This remediation strategy minimizes downtime for stateful applications and ReadWriteOnce (RWO) volumes, and restores compute capacity if transient failures occur.
The Self Node Remediation Operator creates a SelfNodeRemediationConfig CR in the operator's namespace.
You can edit the SelfNodeRemediationConfig CR to configure parameters such as the time that the Operator waits before recovering affected workloads that are running on an unhealthy node, or the file path of the watchdog device in the nodes, or the frequency to verify connectivity with each API server.
If the Self Node Remediation Operator detects a change in the SelfNodeRemediationConfig CR, then it re-creates the Self Node Remediation DaemonSet.
The operator uses the MachineHealthCheck controller to detect the health of a node in the cluster.
If an unhealthy node is detected according to the SelfNodeRemediationConfig CR, then the MachineHealthCheck controller creates the SelfNodeRemediation CR to trigger the Self Node Remediation Operator.
Then, the operator uses the SelfNodeRemediationTemplate custom resource definition (CRD) to define the remediation strategy for the unhealthy nodes.
You can install the Self Node Remediation Operator from the command line by creating a subscription in the openshift-operators namespace, or via the web console through the OperatorHub menu.
For more information about the Self Node Remediation Operator, refer to the Using Self Node Remediation chapter in the Workload Availability for Red Hat OpenShift 24.1 guide at https://access.redhat.com/documentation/en-us/workload_availability_for_red_hat_openshift/24.1/html-single/remediation_fencing_and_maintenance/index#self-node-remediation-operator-remediate-nodes
If a node fails, and you do not deploy machine health checks and the Self Node Remediation Operator on the cluster, then OpenShift does not automatically relocate VMs that are configured with the RunStrategy: Always parameter to healthy nodes.
Instead, you must manually remove the node by deleting the node object.
However, if you installed the cluster with IPI and correctly configured machine health checks, then OpenShift automatically recycles the failed nodes, and schedules the VMs with the RunStrategy: Always or the RunStrategy: RerunOnFailure parameter on healthy nodes.
Deleting a node deletes the node object in Kubernetes, but Kubernetes does not delete the pods on the node. If a replica set does not back the pods on the node, then the pods become inaccessible. OpneShift only reschedules pods that are backed by replica sets to other available nodes.
You can delete a node from your cluster by completing the following steps:
Mark the node as not schedulable:
[user@host ~]$ oc adm cordon worker01
node/worker01 cordonedDrain all pods and VMs on the node:
[user@host ~]$ oc adm drain worker01 --ignore-daemonsets=true \
--delete-emptydir-data --force
node "worker01" drainedThis step might fail if the node is offline or unresponsive. To avoid data corruption, power off the physical hardware before you proceed.
Delete the node from the cluster:
[user@host ~]$ oc delete node worker01
node "worker01" deletedThe node can still rejoin the cluster after a reboot or if the kubelet service is restarted. To permanently delete the node and all its data, you must decommission the node. For more information about decommissioning a node, refer to the How to Destroy All the Data from Server for Decommission solution at https://access.redhat.com/solutions/84663
Verify the status of the VMs that are rescheduled on healthy nodes:
[user@host ~]$ oc get vmis --all-namespaces
NAMESPACE NAME AGE PHASE IP NODENAME READY
vm-qa fedora-test 4h6m Running 10.9.0.47 worker02 TrueFor more information, refer to the Working with Nodes chapter in the Red Hat OpenShift Container Platform 4.14 Nodes documentation at https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/html-single/nodes/index#working-with-nodes