After completing this section, you should be able to explain a failover in a multitarget system replication environment to demonstrate the general behavior.
SAP HANA Multitarget System Replication consists of three or more system replication sites, in either the same or different data centers, which are kept in sync through HANA System Replication (HSR).
It is described in more detail here: SAP HANA Multitarget System Replication, https://help.sap.com/docs/SAP_HANA_PLATFORM/4e9b18c116aa42fc84c7dbfd02111aba/ba457510958241889a459e606bbcf3d3.html.
This lecture describes how multitarget system replication works and which options are recommended.
SAP HANA System Replication supports different replicationMode log replication modes; see https://help.sap.com/docs/SAP_HANA_PLATFORM/6b94445c94ae495c83a19646e7c3fd56/c3fe0a3c263c49dc9404143306455e16.html?q=syncmem.
| sync | After primary receives an acknowledgement, the buffer is persisted by all the tiers. |
| syncmem | After primary receives an acknowledgement, it is unclear whether the buffer is persisted by all the tiers. It has potentially lower impact to the primary compared to sync. |
| async | It is used for longer distances. The ASYNC replication buffer (an intermediate memory buffer) might be running full, with the lowest impact to the primary database. |
MTR supports operationMode (Log Operation Mode):
logreplay
logreplay_readaccess
The following section covers the use cases. It describes in more detail what happens in cases of failure. It also covers the behavior in combination with a cluster.
The following use cases are covered:
Normal operation
Failover to secondary
Cleaning up the configuration
Failback
Manual interaction in cases of failure
Differences when using Pacemaker
The setup is done as described in the former lecture.
The primary database is running in the first data center. In the second and third data centers, updates are transferred for another running SAP HANA database, with the replication option being used.
| DC1 is running primary | DC2 is running secondary | DC3 is running secondary |
Depending on the workload and the distance, the relevant replication mode is used.
In a typical configuration, Pacemaker automatically controls DC1 and DC2.
DC3 is outside the control of Pacemaker, and uses the register_secondaries_on_takeover option that is set in the global.ini file.
| DC1 failed | DC2 becomes new primary | DC3 is losing primary |
| DC1 waiting for primary | DC2 is running primary | DC3 is re-registered to DC2 |
DC1 must be re-registered either manually or by Pacemaker with the AUTOMATED_REGISTER=true option, or with the register_secondaries_on_takeover option in global.ini
| DC2 is running primary | DC3 is re-registered to DC2 |
| DC1 is running secondary | DC2 is running primary | DC3 is running secondary |
First, you must verify whether cleanup is necessary. Verify the status of SAP HANA system replication, and the status of the cluster if Pacemaker is used.
| Topic | Procedure to check | Expected results |
|---|---|---|
| HSR |
hdbnsutil -sr_state
| Find the primary and check the hsr status on the primary node |
| PCS |
pcs status --full
| Verify the status of the cluster and clean it up with the pcs resource cleanup <resource-name> command when necessary |
Potential steps to do are as follows:
| Topic | Procedure to do |
|---|---|
| Node is not registered | Use hdbnsutil to register the node. |
| Resource failed or is not enabled | Use the pcs resource clear or the pcs resource enable command to fix it. |
| DC1 | DC2 | DC3 |
|---|---|---|
| DC1 is running secondary | DC2 is running primary | DC3 is running secondary |
| S2 | P2 | S2 |
hdbnsutil -sr_takeover
| Database is stopped | Database is re-registered to S1 |
| DC1 is running primary | Database becomes secondary | DC3 is running secondary |
If the database on DC2 is not started or is not registered as secondary, it must be done manually. Review the next section.
| Task | command | dependencies |
|---|---|---|
| Start database |
sidadm% sapcontrol -nr 00 -function StartSystem
| Mountpoints, IP addresses |
| Enable resource |
pcs resource enable <resourcename>
| Pacemaker is used |
| Clear resource |
pcs resource clear <resourcename>
| Pacemaker is used and errors are listed |
| Clean up resource |
pcs resource cleanup <resourcename>
| Pacemaker is used and errors are listed |
| Register secondary |
sidadm% hdbnsutil -sr_register --remoteHost=<primarynide> --remoteInstance=00 --replicationMode=syncmem --name=<localsitename> --online
| HANA environment is installed and primary is up and running |
| Enable replication |
sidadm% hdbnsutil -sr_enable
|
hdbnsutil -sr_state on primary shows that replication is not enabled |
| Take over primary |
sidadm% hdbnsutil -sr_takeover
| Running on secondary to take over primary database node |
| Check system replication status |
sidadm% hdbnsutil -sr_state
| Running on primary node with active database |
| Check system replication status offline |
sidadm% hdbnsutil -sr_stateConfiguration
| Displaying the system replication relationship that is stored in the global.ini file |
If you are using Pacemaker, then DC1 and DC2 are controlled by the Pacemaker cluster and the resource agent.
The nodes of DC3 are part of the cluster. However, constraints must be set to ensure that none of the SAPHANA resources can run on one of the nodes that belongs to DC3.
List all SAP HANA database instances:
[rh1adm]# /usr/sap/hostctrl/exe/sapcontrol -nr 00 -function GetSystemInstanceList
10.04.2019 08:38:21
GetSystemInstanceList
OK
hostname, instanceNr, httpPort, httpsPort, startPriority, features,
dispstatus
dc1hana01, 00, 50013, 50014, 0.3, HDB|HDB_WORKER, GREEN
dc1hana03, 00, 50013, 50014, 0.3, HDB|HDB_STANDBY, GREEN
dc1hana02, 00, 50013, 50014, 0.3, HDB|HDB_WORKER, GREEN
dc1hana04, 00, 50013, 50014, 0.3, HDB|HDB_WORKER, GREENShow the landscapeHostConfiguration definitions:
[root]# HDBSettings.sh landscapeHostConfiguration.py
ok
rh1adm@dc1hana01:/usr/sap/RH1/HDB00> HDBSettings.sh
landscapeHostConfiguration.py
| Host | Host | Host | Failover | Remove | Storage |
Storage | Failover | Failover | NameServer | NameServer |
IndexServer | IndexServer | Host | Host | Worker | Worker |
| | Active | Status | Status | Status | Config |
Actual | Config | Actual | Config | Actual | Config
| Actual | Config | Actual | Config | Actual |
| | | | | | Partition |
Partition | Group | Group | Role | Role | Role
| Role | Roles | Roles | Groups | Groups |
| --------- | ------ | ------ | -------- | ------ | --------- |
--------- | -------- | -------- | ---------- | ---------- |
----------- | ----------- | ------- | ------- | ------- | ------- |
| dc1hana01 | yes | ok | | | 1 |
1 | default | default | master 1 | master | worker |
master | worker | worker | default | default |
| dc1hana02 | yes | ok | | | 2 |
2 | default | default | master 3 | slave | worker |
slave | worker | worker | default | default |
....Review the HANA system replication:
[rh1adm]# python /usr/sap/RH1/HDB02/exe/python_support/systemReplicationStatus.py
| Host | Port | Service Name | Volume ID | Site ID | Site Name | Secondary |
| | | | | | | Host |
| ----- | ----- | ------------ | --------- | ------- | --------- | --------- |
| node1 | 30201 | nameserver | 1 | 1 | DC1 | node2 |
| node1 | 30207 | xsengine | 2 | 1 | DC1 | node2 |
| node1 | 30203 | indexserver | 3 | 1 | DC1 | node2 |
Secondary | Secondary | Secondary | Secondary | Replication | Replication |
Port | Site ID | Site Name | Active Status | Mode | Status |
--------- | --------- | --------- | ------------- | ----------- | ----------- |
30201 | 2 | DC2 | YES | SYNCMEM | ACTIVE |
30207 | 2 | DC2 | YES | SYNCMEM | ACTIVE |
30203 | 2 | DC2 | YES | SYNCMEM | ACTIVE |
status system replication site "2": ACTIVE
overall system replication status: ACTIVE
Local System Replication State
~~~~~~~~~~
mode: PRIMARY
site id: 1
site name: DC1Test SrConnectionChangedHook:
# To check if hook scripts are working [rh1adm]#cdtrace[rh1adm]#awk '/ha_dr_SAPHanaSR.*crm_attribute/ \>{ printf "%s %s %s %s\n",$2,$3,$5,$16 }' nameserver_*2018-05-04 12:34:04.476445 ha_dr_SAPHanaSR SFAIL 2018-05-04 12:53:06.316973 ha_dr_SAPHanaSR SOK [rh1adm]#grep ha_dr_ *
Review the SAPHanaTopology resource:
[root]# pcs resource show SAPHanaTopology_RH1_00-clone
Clone: SAPHanaTopology_RH1_00-clone
Meta Attrs: clone-max=2 clone-node-max=1 interleave=true
Resource: SAPHanaTopology_RH1_00 (class=ocf provider=heartbeat type=SAPHanaTopology)
Attributes: SID=RH1 InstanceNumber=02
Operations: start interval=0s timeout=600 (SAPHanaTopology_RH1_00-start-interval-0s)
stop interval=0s timeout=300 (SAPHanaTopology_RH1_00-stop-interval-0s)
monitor interval=10 timeout=600 (SAPHanaTopology_RH1_00-monitor-interval-10s)Review the SAPHana resource:
[root]# pcs resource config SAPHana_RH1_00
Clone: SAPHana_RH1_00-clone
Meta Attrs: clone-max=2 clone-node-max=1 interleave=true notify=true promotable=true
Resource: SAPHana_RH1_00 (class=ocf provider=heartbeat type=SAPHana)
Attributes: AUTOMATED_REGISTER=true DUPLICATE_PRIMARY_TIMEOUT=180 InstanceNumber=02 PREFER_SITE_TAKEOVER=true SID=RH1
Operations: demote interval=0s timeout=3600 (SAPHana_RH1_00-demote-interval-0s)
methods interval=0s timeout=5 (SAPHana_RH1_00-methods-interval-0s)
monitor interval=61 role=Slave timeout=700 (SAPHana_RH1_00-monitor-interval-61)
monitor interval=59 role=Master timeout=700 (SAPHana_RH1_00-monitor-interval-59)
promote interval=0s timeout=3600 (SAPHana_RH1_00-promote-interval-0s)
start interval=0s timeout=3600 (SAPHana_RH1_00-start-interval-0s)
stop interval=0s timeout=3600 (SAPHana_RH1_00-stop-interval-0s)Check the cluster:
[root]# pcs status --full
Cluster name: hanascaleoutsr
Stack: corosync
Current DC: majoritymaker (9) (version 1.1.18-11.el7_5.4-2b07d5c5a9)
- partition with quorum
Last updated: Tue Mar 26 16:34:22 2019
Last change: Tue Mar 26 16:34:03 2019 by root via crm_attribute on
dc2hana01
9 nodes configured
20 resources configured
Online: [ dc1hana01 (1) dc1hana02 (2) dc1hana03 (3) dc1hana04 (4)
dc2hana01 (5) dc2hana02 (6) dc2hana03 (7) dc2hana04 (8) majoritymaker
(9) ]
......
--------------------------------------------------------
1 PROMOTED master1:master:worker:master 150 DC1
2 DEMOTED master2:slave:worker:slave 110 DC1
3 DEMOTED slave:slave:worker:slave -10000 DC1
4 DEMOTED master3:slave:standby:standby 115 DC1
5 DEMOTED master2:master:worker:master 100 DC2
6 DEMOTED master3:slave:worker:slave 80 DC2
7 DEMOTED slave:slave:worker:slave -12200 DC2
8 DEMOTED master1:slave:standby:standby 80 DC2
9 :shtdown:shtdown:shtdownCheck VIP1:
[root]# pcs resource show vip_RH1_00
Resource: vip_RH1_00 (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=192.168.0.15
Operations: start interval=0s timeout=20s (vip_RH1_00-start-interval-0s)
stop interval=0s timeout=20s (vip_RH1_00-stop-interval-0s)
monitor interval=10s timeout=20s (vip_RH1_00-monitor-interval-10s)This concludes the section for SAP HANA scale-out multitarget system replication failover.