Bookmark this page

Testing System Integrity with Fencing

Objectives

After completing this section, you should be able to describe the role of fencing and its involvement in the takeover process to maintain integrity and high availability.

What Is Fencing?

When a node in the cluster fails, it can lead to a loss of data integrity in the cluster, because the node could still have access to the cluster resources. For this reason, it is necessary for an external method to restrict the access of the unresponsive node to the cluster resources. This method is called fencing. The Red Hat Enterprise Linux High Availability Add-On uses fencing to ensure data integrity in the cluster.

Powering off the node often accomplishes fencing, because a dead node cannot do anything. In other cases, fencing uses a combination of operations to cut off the node from the network (to stop new work from arriving), or from storage (to stop the node from writing to shared storage). Fencing is a necessary step in service and resource recovery in a cluster. The Red Hat Enterprise Linux High Availability Add-On does not start resource and service recovery for an unresponsive node until the cluster has fenced that node. For fencing to be correctly configured in your cluster, every node in the cluster must be able to fence every other node.

Cluster Operation without Fencing Is Not Supported

Without fencing, it is not possible to guarantee data integrity on shared storage resources as well as High Availability of the cluster. Sometimes, it is necessary to avoid a response from the second cluster node, if it assumes that SAP HANA services are still being provided. Then, it is necessary to ensure that only one node can respond, and therefore the cluster must fence the other node.

Important

To avoid such a scenario, fencing is a requirement. It must be configured and physically tested. See https://access.redhat.com/solutions/18803

Cluster Operation with Fencing

You must ensure that the unresponsive node no longer has access to the SAP HANA database instance, the applications, and the file system as applicable, before another node attempts to mount it. This procedure is called fencing. Fencing stops node 1 from continuing to remain as primary. It prevents any access to the SAP HANA databases, or application, or the file systems if applicable. It happens before node 2 takes over the resource, to prevent a Dual Primary/Split-Brain situation or data corruption. Fencing ensures data integrity.

Important

Red Hat does not provide support for clusters without a configured STONITH device for any member, short for Shoot The Other Node In The Head. The expected behavior of such a cluster is undefined. Clusters without STONITH for every node might behave in problematic ways, and cannot be relied on for High Availability of critical services or applications.

Fencing Mechanism Overview

Fencing comes in two methods: power fencing and storage fencing. Both fencing methods require a fence device, such as a power switch or the virtual fencing daemon, and fencing agent software to enable communication between the cluster and the fencing device. When the cluster needs to fence a node, it delegates the operation to the fence agent. In turn, the fence agent contacts the fence device to perform the action. In environments where SAP HANA databases are managed within the cluster, it is recommended to use the power fencing method only, because SAP HANA databases are in-memory, and so the storage fencing method would not be useful.

Describing Power Fencing:

Power fencing entails cutting off power to a server.

Two types of power fencing devices exist:

  • External fencing hardware that cuts off the power, such as a network-controlled power strip.

  • Internal fencing hardware, such as iLO, DRAC, IPMI, or virtual machine fencing, which powers off the hardware of the node.

Configuration of the power fencing can turn off the target machine and keep it off, or turn it off and then on again. Turning a machine back on has the added benefit that it should come back up cleanly, and rejoin the cluster if the cluster services are enabled.

The following graphic shows an example of power fencing with a network-controlled power controller and two power supplies in a server.

Figure 6.1: Power fencing with two power supplies

Warning

When using power fencing, if the machine has more than one power supply, then all power supplies should turn off before being turned on again. Otherwise, the fenced machine never turns off, because it does not lose power. Doing so could cause the machine to be apparently fenced without being fenced, and risking data loss.

Describing Storage Fencing

Storage fencing entails disconnecting a machine from storage at the storage level. It can be done by closing ports on a Fibre Channel switch, or by using SCSI reservations. If a machine is fenced only with storage fencing without combining it with power fencing, the administrator must ensure that it joins the cluster again. It is usually done by rebooting or power-cycling the failed node.

The following graphic shows an example of storage fencing by using multipathed Fibre Channel storage.

Figure 6.2: Storage fencing with multipathed Fibre Channel storage and a fiber switch

Combining Fencing Methods

It is possible to combine fencing methods. When a node needs to be fenced, one fence device can cut off Fibre Channel by blocking ports on a Fibre Channel switch, and an iLO card can then power-cycle the offending machine. Multiple fencing methods can act as a backup for each other. For example, the cluster nodes are first fenced by power fencing, and if that method fails, with storage fencing.

Setting up Fencing Devices

Fencing is a requirement for every operational cluster. When setting up fencing for the cluster, the first step is to set up the hardware or software device that performs the actual fencing. The Red Hat Enterprise Linux High Availability Add-On provides various fencing agents to use with different fence devices. You can install the most-used fencing agents that Red Hat supports with the fence-agents-all package. The pcs stonith list provides a list with the name and description of all the installed fencing agents:

[root@node ~]# pcs stonith list
fence_amt_ws - Fence agent for AMT (WS)
fence_apc - Fence agent for APC over telnet/ssh
fence_apc_snmp - Fence agent for APC, Tripplite PDU over SNMP
fence_bladecenter - Fence agent for IBM BladeCenter
...output omitted...

Depending on the fence device and fencing agent in use, the required parameters are different. Fencing of a cluster node is successful only if communication occurs between the fencing agent and the fencing device. Moreover, the fencing agent must pass the required set of parameters to the fencing device.

A man page, which describes the fencing agent parameters, is available in the system for every shipped fencing agent. It is also possible to list the parameters for a specific fencing agent by executing the pcs stonith describe fence_agent command.

[root@node ~]# pcs stonith describe fence_rhevm
fence_rhevm - Fence agent for RHEV-M REST API

You can use the fence_rhevm I/O fencing agent with RHEV-M REST API to fence virtual machines.

Stonith options:
ip (required): IP address or hostname of fencing device
ipport: TCP/UDP port to use for connection with device
...output omitted...

In the preceding example, the description shows that the fence_rhevm fence agent is appropriate to fence virtual machines by using the Red Hat Virtualization Manager REST API. This fence agent requires the ip parameter, with an optional ipport parameter.

Describing Examples of Fence Device Configuration

Distinct fencing devices require different hardware and software configurations. The hardware must always be set up and configured. For the software configuration, it is necessary to document various configuration parameters. The fencing agent uses these documented parameters. The following examples are of different fencing devices, and their hardware and software configurations.

Defining APC Network Power Switch Fencing

One way to configure power fencing is to use an APC network power switch. The hardware setup includes power cabling of the cluster nodes with the APC network power switch. Fencing with an APC network power switch requires the fencing agent to log in to the power switch to control the power outlet of a specific node. For setting up fencing with an APC fence device, it is important to document at least the following switch settings for later use with the fencing agent (fence_apc):

  • IP address of the APC fence device

  • Username and password to access the APC fence device

  • Network protocol to access the device (SSH or Telnet)

  • The plug number, UUID, or identification for each cluster node

Defining Management Hardware Fencing

Management hardware, such as iLO, DRAC, or IPMI hardware, can power down, power up, and power-cycle systems. At a minimum, system administrators should configure and know the following parameters to use management cards as fencing devices:

  • IP address of the management device

  • Username and password to access the management fence device

  • The machines that the management fence device handles

Red Hat provides different fencing agents for iLO, DRAC, or IPMI hardware, for example fence_ilo, fence_drac5, or fence_ipmilan.

Defining SCSI Fencing

SCSI fencing does not require any dedicated physical hardware for fencing. The cluster administrator needs to know which device must be blocked from cluster node access with SCSI reservation. Red Hat provides the fence_scsi fence agent for this purpose.

Defining Virtual Machine Fencing

The Red Hat Enterprise Linux High Availability Add-On ships with several fencing agents for different hypervisors, for example fence_rhevm or fence_vmware_rest. They accept the same types of parameters:

  • IP or host name of the hypervisor

  • Username and password to access the hypervisor

  • The virtual machine name for each node

Defining Libvirt Fencing

Cluster nodes that are virtual machines that run on a Red Hat Enterprise Linux host with KVM/libvirt require the fence-virtd to be configured and running on the hypervisor. Virtual machine fencing in multicast mode works by sending a fencing request that is signed with a shared secret key to the libvirt fencing multicast group. The actual node virtual machines can then run on different hypervisor machines, provided that fence-virt is configured in all hypervisors for the same multicast group, with the same shared secret.

Defining Cloud Instance Fencing

The Red Hat Enterprise Linux High Availability Add-On supports fencing agents for the major cloud providers. For example, fence-agents-aliyun, fence-agents-aws, or fence-agents-azure-arm are the supported fencing agents for Alibaba Cloud Instances, Amazon Web Services EC2 Instances, or Microsoft Azure Virtual Machines, respectively. Although Red Hat supports these packages, they are not part of the fence-agents-all package, and you must install them individually. Because every cloud provider requires different parameters, reviewing them is beyond the scope of this chapter.

Note

For more information about supported fencing agents for the major cloud providers, see the Cluster Platforms and Architectures section in Knowledgebase: Support Policies for RHEL High Availability Clusters, https://access.redhat.com/articles/2912891

Testing Fence Devices

Fencing is crucial for an operational cluster. A cluster administrator must test the fencing setup thoroughly. It is possible to test the fencing device setup by calling fencing agents from the command line. All fencing agents reside in /usr/sbin/fence_*. The fencing agents typically take a -h option to show all available options. You can also use pcs stonith describe fence_agent to investigate possible options. The required options to test fencing differ from agent to agent.

For example, to show all available options for the fence_ipmilan fencing agent, you can use the following command:

[root@node ~]# fence_ipmilan -h
Usage:
        fence_ipmilan [options]
Options:
-a, --ip=[ip]
-l, --username=[name]
-p, --password=[password]
-P, --lanplus
-A, --auth=[auth]
...output omitted...

Then, for a fencing device at the 192.168.100.101 IP address, you can use the fence_ipmilan command to fence the node that is connected to that fence device:

[root@node ~]# fence_ipmilan --ip=192.168.100.101 \
> --username=admin --password=password
Success: Rebooted

Note

To test fencing with an APC power switch, it is helpful to plug in a lamp for testing instead of the actual cluster node. You can then see whether controlling the power works as expected.

Important

Calling a fencing agent directly to test fencing verifies whether the fencing device is working properly. However, it does not verify whether the fencing configuration in the cluster is correct.

Configuring Cluster Fencing Agents

A system administrator can create a fence device in the cluster with the pcs stonith create command:

[root@node ~]# pcs stonith create name fencing_agent [fencing_parameters]

The command requires additional arguments:

  • name: The name for the STONITH fence device

  • fencing_agent: The fencing agent that the fence device uses

  • fencing_parameters: Required parameters for the fencing agent

As stated in the previous section, you must select the appropriate fencing agent for your fence device. The pcs stonith list command provides a list with the name and a description of all the installed fencing agents.

Note

The fence-agents-all package provides the most used fencing agents that Red Hat supports. However, the package does not include all the supported fencing agents, and you should install some of them manually. For more information about the supported fencing agents, see Knowledgebase: Support Policies for RHEL High Availability Clusters, https://access.redhat.com/articles/2912891

All the fencing agents that are shipped with Red Hat Enterprise Linux High Availability Add-On have generic properties:

pcmk_reboot_timeout

This setting defines the time to wait for fencing to complete in seconds. The default value is 60 seconds, but you can define a device-specific timeout with the pcmk_reboot_timeout property. If a fencing action takes longer than this timeout, then the cluster considers that the fence operation failed. You can set this value at the cluster level with pcs property set stonith-timeout=180s.

pcmk_host_list

This parameter provides a space-separated list of nodes that the fencing device controls. It is required if pcmk_host_check is set to static-list.

pcmk_host_map

You usually use this parameter with fence agents that you can use to fence multiple nodes. For example, with the fence_rhevm agent, you create only one STONITH resource to fence all the cluster nodes. You list the nodes in pcmk_host_map and associate each one with its Red Hat Virtualization VM name. The list is a semicolon-separated list of nodename:port mappings, such as node1.example.com:1;node2.example.com:2.

pcmk_host_check

This parameter defines how the cluster determines the nodes that might be controlled from the fencing device. The following values are possible:

  • dynamic-list: The cluster queries the fencing device. It works only if the fencing device can return a list of ports, and the port names match the names of the cluster nodes.

  • static-list: The cluster uses a list of nodes from pcmk_host_list, or a list of nodename:port mappings from pcmk_host_map.

  • none: The cluster assumes that every fencing device can fence every node.

The default setting for pcmk_host_check is dynamic-list, but it switches to static-list whenever the pcmk_host_map or pcmk_host_list option is used.

If you do not set any of the pcmk_host_* options, then the cluster defaults to querying the fence device for a list of ports. If a port name matches the name of a cluster node, then the cluster uses that port to fence that device. You can also create a fencing device for a single node by setting the pcmk_host_list parameter to the name of the node to fence.

Note

Not all fencing devices support listing ports for use with pcmk_host_check="dynamic-list". In those cases, you must use the pcmk_host_map option.

In addition to the previously listed generic fencing properties, fencing agent-specific properties exist. The pcs stonith describe fence_agent command shows all required and optional parameters that you can set for a particular fence device:

[root@node ~]# pcs stonith describe fence_apc
Stonith options for: fence_apc
ip (required): IP address or hostname of fencing device
username (required): Login name
password: Login password or passphrase
...output omitted...

Warning

Storage-based fence devices cut off a fenced node from storage access. A storage-based fence device does not power-cycle a fenced node. When configuring a storage-based fence agent such as fence_scsi as a cluster fence device, it is important to add the meta provides=unfencing metaparameter. With this parameter, the node can automatically get unfenced before it reboots. The parameter also starts the cluster services in the node. Then, the node can rejoin the cluster.

To create a fencing device by using the fence_apc fencing agent, you can use the following command:

[root@node ~]# pcs stonith create fence_node1 fence_apc ip=10.168.0.101 \
> username=admin password=s3cr3t pcmk_host_list="node1.example.com"

Displaying Fencing Devices

A system administrator can run the pcs stonith status command to view the list of configured fence devices in the cluster, the fencing agent that is used, and the current status of the fence device. Fence device status can be Started or Stopped. If the status of a fence device is Started, then the device is operational; if it is Stopped, then the fence device is not operational.

[root@node ~]# pcs stonith status
* fence_node1 (stonith:fence_apc):
* fence_node2 (stonith:fence_apc):
* fence_node3 (stonith:fence_apc):
Started node1.example.com
Started node2.example.com
Started node3.example.com

The pcs stonith config command shows the configuration options of all the STONITH resources. You can also specify the name of the STONITH resource as a parameter to show the configuration options only for that resource.

[root@node ~]# pcs stonith config fence_node1
Resource: fence_node1 (class=stonith type=fence_apc)
Attributes: ip=10.168.0.101 password=s3cr3t pcmk_host_list=node1.example.com
username=admin
Operations: monitor interval=60s (fence_node1-monitor-interval-60s)

Changing Fencing Devices

You can change fencing device options with the pcs stonith update fence_device_name command. You can add a fence device option or change an existing one. For example, imagine that the fence_node2 fencing device wrongly fences node1 instead of node2. You can correct it by executing the following command:

[root@node ~]# pcs stonith update fence_node2 pcmk_host_list=node2.example.com

Removing Fencing Devices

At some point, it might be necessary to remove a fencing device from the cluster. You might want to remove the corresponding cluster node permanently, or to use a different fencing mechanism to fence the node. You can use the pcs stonith delete fence_device_name command to remove a fencing device from the cluster. For example, to remove the fence_node4 fencing device from the cluster, execute this command:

[root@node ~]# pcs stonith delete fence_node4
Attempting to stop: fence_node4...Stopped
Deleting Resource - fence_node4

Testing Fence Configuration

Use one of the following ways to verify whether a cluster fencing configuration is fully operational:

  • By using the pcs stonith fence nodename command. It attempts to fence the requested node. Run the command from the other node in the cluster, not from the node to fence. If the command runs successfully, then the cluster can fence this node, and fencing on that node works correctly.

  • By disabling the network on a node, by unplugging the network cables, or closing the cluster ports on the firewall, or disabling the entire network stack. The other nodes in the cluster should detect that the node failed, and fence it. This procedure also tests the cluster's ability to detect a failed node.

This concludes the chapter on troubleshooting with problematic nodes.

For more information, refer to the High Availability Add-On Overview chapter in the Configuring and Managing High Availability Clusters guide at https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/configuring_and_managing_high_availability_clusters/index#assembly_overview-of-high-availability-configuring-and-managing-high-availability-clusters

Knowledgebase article: Support Policies for RHEL High Availability Clusters - General Requirements for Fencing/STONITH at https://access.redhat.com/articles/2881341

fence_*(8) man pages

For more information, refer to Chapter 9. Configuring Fencing in a Red Hat High Availability Cluster in the Configuring and Managing High Availability Clusters guide at https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/configuring_and_managing_high_availability_clusters/index#assembly_configuring-fencing-configuring-and-managing-high-availability-clusters

Knowledgebase article: How to Configure Fence Agent 'fence_xvm' in RHEL Cluster at https://access.redhat.com/solutions/917833

pcs(8) and fence_*(8) man pages

For more information, refer to Chapter 9. Configuring Fencing in a Red Hat High Availability Cluster in the Configuring and Managing High Availability Clusters guide at https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/configuring_and_managing_high_availability_clusters/index#assembly_configuring-fencing-configuring-and-managing-high-availability-clusters

Knowledgebase: What Format Should I Use to Specify Node Mappings to STONITH Devices in pcmk_host_list and pcmk_host_map in a RHEL 6, 7, or 8 High Availability Cluster? at https://access.redhat.com/solutions/2619961

Revision: rh445-8.4-4e0c572