RH445 - ch03s05

Bookmark this page

Explaining Failover in the Multitarget System Replication Environment

Objectives

After completing this section, you should be able to explain a failover in a multitarget system replication environment to demonstrate the general behavior.

Introduction

SAP HANA Multitarget System Replication consists of three or more system replication sites, in either the same or different data centers, which are kept in sync through HANA System Replication (HSR).

It is described in more detail here: SAP HANA Multitarget System Replication, https://help.sap.com/docs/SAP_HANA_PLATFORM/4e9b18c116aa42fc84c7dbfd02111aba/ba457510958241889a459e606bbcf3d3.html.

This lecture describes how multitarget system replication works and which options are recommended.

SAP HANA System Replication supports different replicationMode log replication modes; see https://help.sap.com/docs/SAP_HANA_PLATFORM/6b94445c94ae495c83a19646e7c3fd56/c3fe0a3c263c49dc9404143306455e16.html?q=syncmem.

sync	After primary receives an acknowledgement, the buffer is persisted by all the tiers.
syncmem	After primary receives an acknowledgement, it is unclear whether the buffer is persisted by all the tiers. It has potentially lower impact to the primary compared to sync.
async	It is used for longer distances. The ASYNC replication buffer (an intermediate memory buffer) might be running full, with the lowest impact to the primary database.

MTR supports operationMode (Log Operation Mode):

logreplay
logreplay_readaccess

The following section covers the use cases. It describes in more detail what happens in cases of failure. It also covers the behavior in combination with a cluster.

Use Cases

The following use cases are covered:

Normal operation
Failover to secondary
Cleaning up the configuration
Failback
Manual interaction in cases of failure
Differences when using Pacemaker

Normal Operation

The setup is done as described in the former lecture.

The primary database is running in the first data center. In the second and third data centers, updates are transferred for another running SAP HANA database, with the replication option being used.

DC1 is running primary

DC2 is running secondary

DC3 is running secondary

Depending on the workload and the distance, the relevant replication mode is used.

Failover to Secondary

In a typical configuration, Pacemaker automatically controls DC1 and DC2. DC3 is outside the control of Pacemaker, and uses the register_secondaries_on_takeover option that is set in the global.ini file.

DC1 failed	DC2 becomes new primary	DC3 is losing primary
DC1 waiting for primary	DC2 is running primary	DC3 is re-registered to DC2
DC1 must be re-registered either manually or by Pacemaker with the `AUTOMATED_REGISTER=true` option, or with the `register_secondaries_on_takeover` option in `global.ini`	DC2 is running primary	DC3 is re-registered to DC2
DC1 is running secondary	DC2 is running primary	DC3 is running secondary

Cleanup of the Configuration

First, you must verify whether cleanup is necessary. Verify the status of SAP HANA system replication, and the status of the cluster if Pacemaker is used.

Topic	Procedure to check	Expected results
HSR	`hdbnsutil -sr_state`	Find the primary and check the `hsr` status on the primary node
PCS	`pcs status --full`	Verify the status of the cluster and clean it up with the `pcs resource cleanup <resource-name>` command when necessary

Potential steps to do are as follows:

Topic	Procedure to do
Node is not registered	Use `hdbnsutil` to register the node.
Resource failed or is not enabled	Use the `pcs resource clear` or the `pcs resource enable` command to fix it.

Failback to Primary without Pacemaker

DC1	DC2	DC3
DC1 is running secondary	DC2 is running primary	DC3 is running secondary
S2	P2	S2
`hdbnsutil -sr_takeover`	Database is stopped	Database is re-registered to S1
DC1 is running primary	Database becomes secondary	DC3 is running secondary

If the database on DC2 is not started or is not registered as secondary, it must be done manually. Review the next section.

Manual Interaction

Task	command	dependencies
Start database	`sidadm% sapcontrol -nr 00 -function StartSystem`	Mountpoints, IP addresses
Enable resource	`pcs resource enable <resourcename>`	Pacemaker is used
Clear resource	`pcs resource clear <resourcename>`	Pacemaker is used and errors are listed
Clean up resource	`pcs resource cleanup <resourcename>`	Pacemaker is used and errors are listed
Register secondary	`sidadm% hdbnsutil -sr_register --remoteHost=<primarynide> --remoteInstance=00 --replicationMode=syncmem --name=<localsitename> --online`	HANA environment is installed and primary is up and running
Enable replication	`sidadm% hdbnsutil -sr_enable`	`hdbnsutil -sr_state` on primary shows that replication is not enabled
Take over primary	`sidadm% hdbnsutil -sr_takeover`	Running on secondary to take over primary database node
Check system replication status	`sidadm% hdbnsutil -sr_state`	Running on primary node with active database
Check system replication status offline	`sidadm% hdbnsutil -sr_stateConfiguration`	Displaying the system replication relationship that is stored in the `global.ini` file

Differences in Combination with Pacemaker

If you are using Pacemaker, then DC1 and DC2 are controlled by the Pacemaker cluster and the resource agent.

The nodes of DC3 are part of the cluster. However, constraints must be set to ensure that none of the SAPHANA resources can run on one of the nodes that belongs to DC3.

Testing the Configuration

List all SAP HANA database instances:

[rh1adm]# /usr/sap/hostctrl/exe/sapcontrol -nr 00 -function GetSystemInstanceList
10.04.2019 08:38:21
GetSystemInstanceList
OK
hostname, instanceNr, httpPort, httpsPort, startPriority, features,
dispstatus
dc1hana01, 00, 50013, 50014, 0.3, HDB|HDB_WORKER, GREEN
dc1hana03, 00, 50013, 50014, 0.3, HDB|HDB_STANDBY, GREEN
dc1hana02, 00, 50013, 50014, 0.3, HDB|HDB_WORKER, GREEN
dc1hana04, 00, 50013, 50014, 0.3, HDB|HDB_WORKER, GREEN

Show the landscapeHostConfiguration definitions:

[root]# HDBSettings.sh landscapeHostConfiguration.py
ok
rh1adm@dc1hana01:/usr/sap/RH1/HDB00> HDBSettings.sh
landscapeHostConfiguration.py
| Host | Host | Host | Failover | Remove | Storage |
Storage | Failover | Failover | NameServer | NameServer |
IndexServer | IndexServer | Host | Host | Worker | Worker |
| | Active | Status | Status | Status | Config |
Actual | Config | Actual | Config | Actual | Config
| Actual | Config | Actual | Config | Actual |
| | | | | | Partition |
Partition | Group | Group | Role | Role | Role
| Role | Roles | Roles | Groups | Groups |
| --------- | ------ | ------ | -------- | ------ | --------- |
--------- | -------- | -------- | ---------- | ---------- |
----------- | ----------- | ------- | ------- | ------- | ------- |
| dc1hana01 | yes | ok | | | 1 |
1 | default | default | master 1 | master | worker |
master | worker | worker | default | default |
| dc1hana02 | yes | ok | | | 2 |
2 | default | default | master 3 | slave | worker |
slave | worker | worker | default | default |
....

Review the HANA system replication:

[rh1adm]# python /usr/sap/RH1/HDB02/exe/python_support/systemReplicationStatus.py


| Host  | Port  | Service Name | Volume ID | Site ID | Site Name | Secondary |
|       |       |              |           |         |           | Host      |
| ----- | ----- | ------------ | --------- | ------- | --------- | --------- |
| node1 | 30201 | nameserver   |         1 |       1 | DC1       | node2     |
| node1 | 30207 | xsengine     |         2 |       1 | DC1       | node2     |
| node1 | 30203 | indexserver  |         3 |       1 | DC1       | node2     |

Secondary | Secondary | Secondary | Secondary     | Replication | Replication |
Port      | Site ID   | Site Name | Active Status | Mode        | Status      |
--------- | --------- | --------- | ------------- | ----------- | ----------- |
    30201 |         2 | DC2       | YES           | SYNCMEM     | ACTIVE      |
    30207 |         2 | DC2       | YES           | SYNCMEM     | ACTIVE      |
    30203 |         2 | DC2       | YES           | SYNCMEM     | ACTIVE      |

status system replication site "2": ACTIVE
overall system replication status: ACTIVE

Local System Replication State
_~_~_~_~_~_~_~_~_~_~

mode: PRIMARY
site id: 1
site name: DC1

Test SrConnectionChangedHook:

# To check if hook scripts are working
[rh1adm]# cdtrace
[rh1adm]# awk '/ha_dr_SAPHanaSR.*crm_attribute/ \
> { printf "%s %s %s %s\n",$2,$3,$5,$16 }' nameserver_*
2018-05-04 12:34:04.476445 ha_dr_SAPHanaSR SFAIL
2018-05-04 12:53:06.316973 ha_dr_SAPHanaSR SOK
[rh1adm]# grep ha_dr_ *

Review the SAPHanaTopology resource:

[root]# pcs resource show SAPHanaTopology_RH1_00-clone

 Clone: SAPHanaTopology_RH1_00-clone
  Meta Attrs: clone-max=2 clone-node-max=1 interleave=true
  Resource: SAPHanaTopology_RH1_00 (class=ocf provider=heartbeat type=SAPHanaTopology)
   Attributes: SID=RH1 InstanceNumber=02
   Operations: start interval=0s timeout=600 (SAPHanaTopology_RH1_00-start-interval-0s)
               stop interval=0s timeout=300 (SAPHanaTopology_RH1_00-stop-interval-0s)
               monitor interval=10 timeout=600 (SAPHanaTopology_RH1_00-monitor-interval-10s)

Review the SAPHana resource:

[root]# pcs resource config SAPHana_RH1_00
 Clone: SAPHana_RH1_00-clone
  Meta Attrs: clone-max=2 clone-node-max=1 interleave=true notify=true promotable=true
  Resource: SAPHana_RH1_00 (class=ocf provider=heartbeat type=SAPHana)
   Attributes: AUTOMATED_REGISTER=true DUPLICATE_PRIMARY_TIMEOUT=180 InstanceNumber=02 PREFER_SITE_TAKEOVER=true SID=RH1
   Operations: demote interval=0s timeout=3600 (SAPHana_RH1_00-demote-interval-0s)
               methods interval=0s timeout=5 (SAPHana_RH1_00-methods-interval-0s)
               monitor interval=61 role=Slave timeout=700 (SAPHana_RH1_00-monitor-interval-61)
               monitor interval=59 role=Master timeout=700 (SAPHana_RH1_00-monitor-interval-59)
               promote interval=0s timeout=3600 (SAPHana_RH1_00-promote-interval-0s)
               start interval=0s timeout=3600 (SAPHana_RH1_00-start-interval-0s)
               stop interval=0s timeout=3600 (SAPHana_RH1_00-stop-interval-0s)

Check the cluster:

[root]# pcs status --full
Cluster name: hanascaleoutsr
Stack: corosync
Current DC: majoritymaker (9) (version 1.1.18-11.el7_5.4-2b07d5c5a9)
- partition with quorum
Last updated: Tue Mar 26 16:34:22 2019
Last change: Tue Mar 26 16:34:03 2019 by root via crm_attribute on
dc2hana01
9 nodes configured
20 resources configured
Online: [ dc1hana01 (1) dc1hana02 (2) dc1hana03 (3) dc1hana04 (4)
dc2hana01 (5) dc2hana02 (6) dc2hana03 (7) dc2hana04 (8) majoritymaker
(9) ]
......
--------------------------------------------------------
1 PROMOTED master1:master:worker:master 150 DC1
2 DEMOTED master2:slave:worker:slave 110 DC1
3 DEMOTED slave:slave:worker:slave -10000 DC1
4 DEMOTED master3:slave:standby:standby 115 DC1
5 DEMOTED master2:master:worker:master 100 DC2
6 DEMOTED master3:slave:worker:slave 80 DC2
7 DEMOTED slave:slave:worker:slave -12200 DC2
8 DEMOTED master1:slave:standby:standby 80 DC2
9 :shtdown:shtdown:shtdown

Check VIP1:

[root]# pcs resource show vip_RH1_00

 Resource: vip_RH1_00 (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=192.168.0.15
  Operations: start interval=0s timeout=20s (vip_RH1_00-start-interval-0s)
              stop interval=0s timeout=20s (vip_RH1_00-stop-interval-0s)
              monitor interval=10s timeout=20s (vip_RH1_00-monitor-interval-10s)

This concludes the section for SAP HANA scale-out multitarget system replication failover.

Revision: rh445-8.4-4e0c572

Red Hat High Availability Clustering for SAP Solutions