zoukankan      html  css  js  c++  java
  • postgresql 高可用 repmgr 的使用之六 1 Primary + 1 Standby 的 auto failover

    os:ubunbu 16.04
    postgresql:9.6.8
    repmgr:4.1.1

    192.168.56.101 node1
    192.168.56.102 node2

    生产环境确保网络稳定的前提下,数据库能够 auto failover 是最好的状态。如果是同步复制的话,建议配置两个slave,如果是异步复制,也建议配置两个slave。

    需要实现 auto failover,就需要启用 repmgrd,而启用 repmgrd 就必须设置 postgresql.conf 的 shared_preload_libraries = ‘repmgr’

    $ vi /etc/postgresql/9.6/main/postgresql.conf
    shared_preload_libraries = 'repmgr'
    
    

    /etc/repmgr.conf 的内容

    node1 节点上查看,node2 节点上基本相同。

    # more /etc/repmgr.conf
    
    ##############################################
    # node information
    node_id=1
    node_name=node1
    conninfo='host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2'
    data_directory='/var/lib/postgresql/9.6/main'
    use_replication_slots=true
    pg_bindir='/usr/lib/postgresql/9.6/bin'
    pg_ctl_options='-l /var/log/postgres/startup.log'
    rsync_options=--exclude=postgresql.local.conf --archive
    #use_primary_conninfo_password=true
    location='location01'
    ##############################################
    # service_start_command
    service_start_command   = 'sudo pg_ctlcluster 9.6 main start'
    service_stop_command    = 'sudo pg_ctlcluster 9.6 main stop'
    service_restart_command = 'sudo pg_ctlcluster 9.6 main restart'
    service_reload_command  = 'sudo pg_ctlcluster 9.6 main reload' 
    service_promote_command = 'sudo pg_ctlcluster 9.6 main promote'
    ##############################################
    # log info
    log_level=INFO
    log_status_interval=10
    log_file=/var/log/postgresql/repmgrd.log
    ##############################################
    # repmgrd options
    monitoring_history=yes
    monitor_interval_secs=5
    reconnect_attempts=10
    reconnect_interval=5
    ##############################################
    # automatic failover
    failover=automatic
    promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'
    follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'
    
    

    启动 repmgrd

    每个节点都是需要启动的
    手动启动

    # repmgrd -f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid
    

    系统启动

    # vi /etc/default/repmgrd 
    
    # default settings for repmgrd. This file is source by /bin/sh from
    # /etc/init.d/repmgrd
    
    # disable repmgrd by default so it won't get started upon installation
    # valid values: yes/no
    REPMGRD_ENABLED=yes
    
    # configuration file (required)
    REPMGRD_CONF="/etc/repmgr.conf"
    
    # additional options
    #REPMGRD_OPTS=""
    
    # user to run repmgrd as
    REPMGRD_USER=postgres
    
    # repmgrd binary
    REPMGRD_BIN=/usr/bin/repmgrd
    
    # pid file
    REPMGRD_PIDFILE=/var/run/repmgrd.pid
    

    REPMGRD_CONF 指向了配置文件
    非常有必要配置成系统启用。

    # systemctl status repmgrd.service
    # systemctl start repmgrd.service
    # systemctl enable repmgrd.service
    

    node1 上的 repmgrd 进程

    # ps -ef|grep -i repmgrd |grep -v grep
    postgres  4091     1  0 14:30 ?        00:00:00 /usr/lib/postgresql/9.6/bin/repmgrd --config-file /etc/repmgr.conf
    
    # cat /var/log/postgresql/repmgrd.log 
    [2018-09-25 14:30:28] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
    [2018-09-25 14:30:28] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2"
    [2018-09-25 14:30:28] [NOTICE] starting monitoring of node "node1" (ID: 1)
    [2018-09-25 14:30:28] [NOTICE] monitoring cluster primary "node1" (node ID: 1)
    [2018-09-25 14:30:38] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
    [2018-09-25 14:30:48] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
    
    

    node2 上的 repmgrd 进程

    # ps -ef|grep -i repmgrd |grep -v grep
    postgres  4349     1  0 14:18 ?        00:00:00 /usr/lib/postgresql/9.6/bin/repmgrd --config-file /etc/repmgr.conf
    
    # cat /var/log/postgresql/repmgrd.log 
    [2018-09-25 14:33:44] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
    [2018-09-25 14:33:44] [INFO] connecting to database "host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2"
    [2018-09-25 14:33:44] [NOTICE] starting monitoring of node "node2" (ID: 2)
    [2018-09-25 14:33:44] [INFO] monitoring connection to upstream node "node1" (node ID: 1)
    [2018-09-25 14:33:55] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 14:33:55] [DETAIL] last monitoring statistics update was 5 seconds ago
    [2018-09-25 14:34:05] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 14:34:05] [DETAIL] last monitoring statistics update was 5 seconds ago
    

    手动关闭主库模拟异常

    node1 节点上操作

    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status    | Upstream | Location | Connection string                                            
    ----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
     1  | node1 | primary | * running |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
     2  | node2 | standby |   running | node1    | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
     
    $ pg_ctl -D /var/lib/postgresql/9.6/main -m fast stop
    或者
    $ sudo pg_ctlcluster 9.6 main stop
    
    $ tail -f /var/log/postgresql/repmgrd.log
    [2018-09-25 15:08:23] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
    [2018-09-25 15:08:23] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2"
    [2018-09-25 15:08:23] [NOTICE] starting monitoring of node "node1" (ID: 1)
    [2018-09-25 15:08:23] [NOTICE] monitoring cluster primary "node1" (node ID: 1)
    [2018-09-25 15:08:33] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
    [2018-09-25 15:08:43] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
    [2018-09-25 15:08:53] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
    [2018-09-25 15:09:03] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
    [2018-09-25 15:09:13] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
    [2018-09-25 15:09:23] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
    [2018-09-25 15:09:33] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
    [2018-09-25 15:09:43] [INFO] monitoring primary node "node1" (node ID: 1) in normal state
    [2018-09-25 15:09:53] [WARNING] unable to connect to local node
    [2018-09-25 15:09:53] [INFO] checking state of node 1, 1 of 10 attempts
    [2018-09-25 15:09:53] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:09:58] [INFO] checking state of node 1, 2 of 10 attempts
    [2018-09-25 15:09:58] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:03] [INFO] checking state of node 1, 3 of 10 attempts
    [2018-09-25 15:10:03] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:08] [INFO] checking state of node 1, 4 of 10 attempts
    [2018-09-25 15:10:08] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:13] [INFO] checking state of node 1, 5 of 10 attempts
    [2018-09-25 15:10:13] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:18] [INFO] checking state of node 1, 6 of 10 attempts
    [2018-09-25 15:10:18] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:23] [INFO] checking state of node 1, 7 of 10 attempts
    [2018-09-25 15:10:23] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:28] [INFO] checking state of node 1, 8 of 10 attempts
    [2018-09-25 15:10:28] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:33] [INFO] checking state of node 1, 9 of 10 attempts
    [2018-09-25 15:10:33] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts
    [2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts
    [2018-09-25 15:10:38] [NOTICE] unable to connect to local node, falling back to degraded monitoring
    [2018-09-25 15:10:38] [INFO] monitoring primary node "node1" (node ID: 1) in degraded state
    [2018-09-25 15:10:38] [DETAIL] waiting for the node to become available
    [2018-09-25 15:10:48] [INFO] monitoring primary node "node1" (node ID: 1) in degraded state
    [2018-09-25 15:10:48] [DETAIL] waiting for the node to become available
    
    

    node2 上查看

    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status        | Upstream | Location | Connection string                                            
    ----+-------+---------+---------------+----------+----------+-----------------------------------------------------------------
     1  | node1 | primary | ? unreachable |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
     2  | node2 | standby |   running     | node1    | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
    
    WARNING: following issues were detected
      - when attempting to connect to node "node1" (ID: 1), following error encountered :
    "could not connect to server: Connection refused
    	Is the server running on host "192.168.56.101" and accepting
    	TCP/IP connections on port 5432?"
      - node "node1" (ID: 1) is registered as an active primary but is unreachable
    
    $ tail -f /var/log/postgresql/repmgrd.log 
    
    [2018-09-25 15:07:58] [NOTICE] repmgrd (repmgrd 4.1.1) starting up
    [2018-09-25 15:07:58] [INFO] connecting to database "host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2"
    [2018-09-25 15:07:58] [NOTICE] starting monitoring of node "node2" (ID: 2)
    [2018-09-25 15:07:58] [INFO] monitoring connection to upstream node "node1" (node ID: 1)
    [2018-09-25 15:08:08] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 15:08:08] [DETAIL] last monitoring statistics update was 5 seconds ago
    [2018-09-25 15:08:18] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 15:08:18] [DETAIL] last monitoring statistics update was 5 seconds ago
    [2018-09-25 15:08:28] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 15:08:28] [DETAIL] last monitoring statistics update was 5 seconds ago
    [2018-09-25 15:08:38] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 15:08:38] [DETAIL] last monitoring statistics update was 5 seconds ago
    [2018-09-25 15:08:48] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 15:08:48] [DETAIL] last monitoring statistics update was 5 seconds ago
    [2018-09-25 15:08:58] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 15:08:58] [DETAIL] last monitoring statistics update was 5 seconds ago
    [2018-09-25 15:09:08] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 15:09:08] [DETAIL] last monitoring statistics update was 5 seconds ago
    [2018-09-25 15:09:18] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 15:09:18] [DETAIL] last monitoring statistics update was 5 seconds ago
    [2018-09-25 15:09:28] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 15:09:28] [DETAIL] last monitoring statistics update was 5 seconds ago
    [2018-09-25 15:09:38] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 15:09:38] [DETAIL] last monitoring statistics update was 5 seconds ago
    [2018-09-25 15:09:48] [INFO] node "node2" (node ID: 2) monitoring upstream node "node1" (node ID: 1) in normal state
    [2018-09-25 15:09:48] [DETAIL] last monitoring statistics update was 5 seconds ago
    
    [2018-09-25 15:09:53] [WARNING] unable to connect to upstream node "node1" (node ID: 1)
    [2018-09-25 15:09:53] [INFO] checking state of node 1, 1 of 10 attempts
    [2018-09-25 15:09:53] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:09:58] [INFO] checking state of node 1, 2 of 10 attempts
    [2018-09-25 15:09:58] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:03] [INFO] checking state of node 1, 3 of 10 attempts
    [2018-09-25 15:10:03] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:08] [INFO] checking state of node 1, 4 of 10 attempts
    [2018-09-25 15:10:08] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:13] [INFO] checking state of node 1, 5 of 10 attempts
    [2018-09-25 15:10:13] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:18] [INFO] checking state of node 1, 6 of 10 attempts
    [2018-09-25 15:10:18] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:23] [INFO] checking state of node 1, 7 of 10 attempts
    [2018-09-25 15:10:23] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:28] [INFO] checking state of node 1, 8 of 10 attempts
    [2018-09-25 15:10:28] [INFO] sleeping 5 seconds until next reconnection attempt
    [2018-09-25 15:10:33] [INFO] checking state of node 1, 9 of 10 attempts
    [2018-09-25 15:10:33] [INFO] sleeping 5 seconds until next reconnection attempt
    
    [2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts
    [2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts
    [2018-09-25 15:10:38] [NOTICE] this node is the only available candidate and will now promote itself
    [2018-09-25 15:10:38] [INFO] promote_command is:
      "/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file"
    [2018-09-25 15:10:38] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgrd.log"
    
    [2018-09-25 15:10:38] [NOTICE] promoting standby to primary
    [2018-09-25 15:10:38] [DETAIL] promoting server "node2" (ID: 2) using "sudo pg_ctlcluster 9.6 main promote"
    [2018-09-25 15:10:38] [DETAIL] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
    [2018-09-25 15:10:39] [NOTICE] STANDBY PROMOTE successful
    [2018-09-25 15:10:39] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
    [2018-09-25 15:10:39] [INFO] switching to primary monitoring mode
    [2018-09-25 15:10:39] [NOTICE] monitoring cluster primary "node2" (node ID: 2)
    [2018-09-25 15:10:49] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
    [2018-09-25 15:10:59] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
    [2018-09-25 15:11:09] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
    [2018-09-25 15:11:19] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
    
    

    在经过 10*5s的时常后,发生了 auto failover: promoting standby to primary,如下

    [2018-09-25 15:10:38] [INFO] checking state of node 1, 10 of 10 attempts
    [2018-09-25 15:10:38] [WARNING] unable to reconnect to node 1 after 10 attempts
    [2018-09-25 15:10:38] [NOTICE] this node is the only available candidate and will now promote itself
    [2018-09-25 15:10:38] [INFO] promote_command is:
      "/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file"
    [2018-09-25 15:10:38] [NOTICE] redirecting logging output to "/var/log/postgresql/repmgrd.log"
    
    [2018-09-25 15:10:38] [NOTICE] promoting standby to primary
    [2018-09-25 15:10:38] [DETAIL] promoting server "node2" (ID: 2) using "sudo pg_ctlcluster 9.6 main promote"
    [2018-09-25 15:10:38] [DETAIL] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
    [2018-09-25 15:10:39] [NOTICE] STANDBY PROMOTE successful
    [2018-09-25 15:10:39] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
    [2018-09-25 15:10:39] [INFO] switching to primary monitoring mode
    [2018-09-25 15:10:39] [NOTICE] monitoring cluster primary "node2" (node ID: 2)
    [2018-09-25 15:10:49] [INFO] monitoring primary node "node2" (node ID: 2) in normal state
    

    node2 上查看 cluster show

    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status    | Upstream | Location | Connection string                                            
    ----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
     1  | node1 | primary | - failed  |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
     2  | node2 | primary | * running |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
    
    WARNING: following issues were detected
      - when attempting to connect to node "node1" (ID: 1), following error encountered :
    "could not connect to server: Connection refused
    	Is the server running on host "192.168.56.101" and accepting
    	TCP/IP connections on port 5432?"
    	
    

    虚拟机掉电模拟异常

    经过前面的测试,node2 现在是 master,node1 通过 node rejoin后变为slave,细节可以参考上一篇blog。
    node2 节点掉电前状态查看

    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status    | Upstream | Location | Connection string                                            
    ----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
     1  | node1 | standby |   running | node2    | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
     2  | node2 | primary | * running |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
    
    

    node2 开始掉电

    node1 上查看

    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status        | Upstream | Location | Connection string                                            
    ----+-------+---------+---------------+----------+----------+-----------------------------------------------------------------
     1  | node1 | standby |   running     | node2    | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
     2  | node2 | primary | ? unreachable |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
    
    WARNING: following issues were detected
      - when attempting to connect to node "node2" (ID: 2), following error encountered :
    "could not connect to server: No route to host
    	Is the server running on host "192.168.56.102" and accepting
    	TCP/IP connections on port 5432?"
      - node "node2" (ID: 2) is registered as an active primary but is unreachable
      
    

    此时 node1 节点的也提升为 master,提升为master的日志输出基本类似,不再重复粘贴。

    参考:
    https://www.2ndquadrant.com/en/resources/repmgr/
    https://github.com/2ndQuadrant/repmgr
    https://repmgr.org/docs/4.1/using-repmgrd.html

  • 相关阅读:
    js 获取asp.net TextBox值
    项目笔记一
    如何取GridView中隐藏列的值
    使用bablepolyfill
    border: 0.5px
    选择查询固定位置的数据
    SQL用户数据库权限设置
    Java.1 HelloWorld的分析
    Java.2 面向对象
    【drupal实践】windows下drupal7.22develop版本安装
  • 原文地址:https://www.cnblogs.com/ctypyb2002/p/9792867.html
Copyright © 2011-2022 走看看