1、前面搭建好了简单的repmgr集群,这时查看集群和repmgr服务状态,可知repmgrd并未运行
[postgres@localhost bin]$ ./repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 1 | host=192.168.101.9 port=5432 user=postgres dbname=postgres 2 | node2 | standby | running | node1 | default | 100 | 1 | host=192.168.101.7 port=5432 user=postgres dbname=postgres
[postgres@localhost bin]$ ./repmgr service status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+-------+---------+-----------+----------+-------------+-----+---------+-------------------- 1 | node1 | primary | * running | | not running | n/a | n/a | n/a 2 | node2 | standby | running | node1 | not running | n/a | n/a | n/a
2、修改repmgr.conf参数
vim /etc/repmgr/12/repmgr.conf failover='automatic' promote_command='/usr/pgsql-12/bin/repmgr standby promote' follow_command='/usr/pgsql-12/bin/repmgr standby follow'
failover参数有两个
automatic:表示开启故障自动切换
manual:不开启故障自动切换
不开启故障自动切换,备机检测到主机故障后的日志如下,可以看到备机不会自动升级为主机
[2020-04-24 22:49:17] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-04-24 22:49:17] [WARNING] unable to reconnect to node 1 after 6 attempts
[2020-04-24 22:49:17] [NOTICE] this node is not configured for automatic failover so will not be considered as promotion candidate, and will not follow the new primary
[2020-04-24 22:49:17] [DETAIL] "failover" is set to "manual" in repmgr.conf
[2020-04-24 22:49:17] [HINT] manually execute "repmgr standby follow" to have this node follow the new primary
[2020-04-24 22:49:17] [INFO] follower node awaiting notification from a candidate node
[2020-04-24 22:50:17] [WARNING] no notification received from new primary after 60 seconds
3、此时开启集群repmgrd进程
主备机bin目录下执行:
./repmgrd -d
4、开启后查看服务状态
[postgres@localhost bin]$ ./repmgr service status ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+-------+---------+-----------+----------+---------+-------+---------+-------------------- 1 | node1 | primary | * running | | running | 11558 | no | n/a 2 | node2 | standby | running | node1 | running | 10818 | no | 0 second(s) ago
5、此时模拟主机故障,备机日志如下
[2020-04-24 23:14:02] [INFO] monitoring connection to upstream node "node1" (ID: 1) [2020-04-24 23:14:38] [WARNING] unable to ping "host=192.168.101.9 port=5432 user=postgres dbname=postgres" [2020-04-24 23:14:38] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-04-24 23:14:38] [WARNING] unable to connect to upstream node "node1" (ID: 1) [2020-04-24 23:14:38] [INFO] checking state of node 1, 1 of 6 attempts [2020-04-24 23:14:38] [WARNING] unable to ping "user=postgres dbname=postgres host=192.168.101.9 port=5432 connect_timeout=2 fallback_application_name=repmgr" [2020-04-24 23:14:38] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-04-24 23:14:38] [INFO] sleeping 10 seconds until next reconnection attempt [2020-04-24 23:14:48] [INFO] checking state of node 1, 2 of 6 attempts [2020-04-24 23:14:48] [WARNING] unable to ping "user=postgres dbname=postgres host=192.168.101.9 port=5432 connect_timeout=2 fallback_application_name=repmgr" [2020-04-24 23:14:48] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-04-24 23:14:48] [INFO] sleeping 10 seconds until next reconnection attempt [2020-04-24 23:14:58] [INFO] checking state of node 1, 3 of 6 attempts [2020-04-24 23:14:58] [WARNING] unable to ping "user=postgres dbname=postgres host=192.168.101.9 port=5432 connect_timeout=2 fallback_application_name=repmgr" [2020-04-24 23:14:58] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-04-24 23:14:58] [INFO] sleeping 10 seconds until next reconnection attempt [2020-04-24 23:15:08] [INFO] checking state of node 1, 4 of 6 attempts [2020-04-24 23:15:08] [WARNING] unable to ping "user=postgres dbname=postgres host=192.168.101.9 port=5432 connect_timeout=2 fallback_application_name=repmgr" [2020-04-24 23:15:08] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-04-24 23:15:08] [INFO] sleeping 10 seconds until next reconnection attempt [2020-04-24 23:15:18] [INFO] checking state of node 1, 5 of 6 attempts [2020-04-24 23:15:18] [WARNING] unable to ping "user=postgres dbname=postgres host=192.168.101.9 port=5432 connect_timeout=2 fallback_application_name=repmgr" [2020-04-24 23:15:18] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-04-24 23:15:18] [INFO] sleeping 10 seconds until next reconnection attempt [2020-04-24 23:15:28] [INFO] checking state of node 1, 6 of 6 attempts [2020-04-24 23:15:28] [WARNING] unable to ping "user=postgres dbname=postgres host=192.168.101.9 port=5432 connect_timeout=2 fallback_application_name=repmgr" [2020-04-24 23:15:28] [DETAIL] PQping() returned "PQPING_NO_RESPONSE" [2020-04-24 23:15:28] [WARNING] unable to reconnect to node 1 after 6 attempts [2020-04-24 23:15:28] [INFO] 0 active sibling nodes registered [2020-04-24 23:15:28] [INFO] primary node "node1" (ID: 1) and this node have the same location ("default") [2020-04-24 23:15:28] [INFO] no other sibling nodes - we win by default [2020-04-24 23:15:28] [NOTICE] this node is the only available candidate and will now promote itself [2020-04-24 23:15:28] [INFO] promote_command is: "/usr/pgsql-12/bin/repmgr standby promote" NOTICE: promoting standby to primary DETAIL: promoting server "node2" (ID: 2) using pg_promote() NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete NOTICE: STANDBY PROMOTE successful DETAIL: server "node2" (ID: 2) was successfully promoted to primary [2020-04-24 23:15:29] [INFO] 0 followers to notify [2020-04-24 23:15:29] [INFO] switching to primary monitoring mode [2020-04-24 23:15:29] [NOTICE] monitoring cluster primary "node2" (ID: 2)
可知备机正确升级为主机提供服务
6、查看集群状态
[postgres@localhost bin]$ ./repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+------------------------------------------------------------- 1 | node1 | primary | - failed | ? | default | 100 | | host=192.168.101.9 port=5432 user=postgres dbname=postgres 2 | node2 | primary | * running | | default | 100 | 2 | host=192.168.101.7 port=5432 user=postgres dbname=postgres