zoukankan      html  css  js  c++  java
  • postgresql 高可用 repmgr 的使用之五 1 Primary + 1 Standby 的 manual failover,node rejoin

    os:ubunbu 16.04
    postgresql:9.6.8
    repmgr:4.1.1

    192.168.56.101 node1
    192.168.56.102 node2

    操作前/etc/repmgr.conf 的内容

    node1 节点上的文件内容,node2 节点上类似

    $ cat /etc/repmgr.conf 
    
    node_id=1
    node_name=node1
    conninfo='host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2'
    data_directory='/var/lib/postgresql/9.6/main'
    use_replication_slots=true
    pg_bindir='/usr/lib/postgresql/9.6/bin'
    service_start_command   = 'sudo pg_ctlcluster 9.6 main start'
    service_stop_command    = 'sudo pg_ctlcluster 9.6 main stop'
    service_restart_command = 'sudo pg_ctlcluster 9.6 main restart'
    service_reload_command  = 'sudo pg_ctlcluster 9.6 main reload' 
    service_promote_command  = 'sudo pg_ctlcluster 9.6 main promote'
    

    手动关闭主库模拟异常

    node1 节点上操作

    $ pg_ctl -D /var/lib/postgresql/9.6/main -m fast stop
    或者
    $ sudo pg_ctlcluster 9.6 main stop
    
    $ repmgr -f /etc/repmgr.conf cluster show
    ERROR: connection to database failed:
      could not connect to server: Connection refused
    	Is the server running on host "192.168.56.101" and accepting
    	TCP/IP connections on port 5432?
    
    DETAIL: attempted to connect using:
      user=repmgr connect_timeout=2 dbname=repmgr host=192.168.56.101 fallback_application_name=repmgr
      
    

    node2 节点上操作

    $ repmgr -f /etc/repmgr.conf cluster show
    
     ID | Name  | Role    | Status        | Upstream | Location | Connection string                                              
    ----+-------+---------+---------------+----------+----------+-----------------------------------------------------------------
     1  | node1 | primary | ? unreachable |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
     2  | node2 | standby |   running     | node1    | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
    
    WARNING: following issues were detected
      - when attempting to connect to node "node1" (ID: 1), following error encountered :
    "could not connect to server: Connection refused
    	Is the server running on host "192.168.56.101" and accepting
    	TCP/IP connections on port 5432?"
      - node "node1" (ID: 1) is registered as an active primary but is unreachable
     
    

    可以看出 node1 的 Status 显示 unreachable

    从库提升为主库

    现在node1节点的postgresql已经不可用了(手动关闭、进程异常终止、宕机),需要提升node2上的standby 为 master。
    node2 节点上操作

    $ repmgr -f /etc/repmgr.conf standby promote
    
    NOTICE: promoting standby to primary
    DETAIL: promoting server "node2" (ID: 2) using "sudo pg_ctlcluster 9.6 main promote"
    DETAIL: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
    NOTICE: STANDBY PROMOTE successful
    DETAIL: server "node2" (ID: 2) was successfully promoted to primary
    
    

    node2 上再次查看

    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status    | Upstream | Location | Connection string                                              
    ----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
     1  | node1 | primary | - failed  |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
     2  | node2 | primary | * running |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
    
    WARNING: following issues were detected
      - when attempting to connect to node "node1" (ID: 1), following error encountered :
    "could not connect to server: Connection refused
    	Is the server running on host "192.168.56.101" and accepting
    	TCP/IP connections on port 5432?"
    

    node1 节点变为新的slave

    node1 节点上操作,启动postgresql

    # /etc/init.d/postgresql start
    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status               | Upstream | Location | Connection string                                              
    ----+-------+---------+----------------------+----------+----------+-----------------------------------------------------------------
     1  | node1 | primary | * running            |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
     2  | node2 | standby | ! running as primary | node1    | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
    
    WARNING: following issues were detected
      - node "node2" (ID: 2) is registered as standby but running as primary
      
    

    node2 节点上操作

    $ repmgr -f /etc/repmgr.conf cluster show
     ID | Name  | Role    | Status    | Upstream | Location | Connection string                                            
    ----+-------+---------+-----------+----------+----------+-----------------------------------------------------------------
     1  | node1 | primary | ! running |          | default  | host=192.168.56.101 user=repmgr dbname=repmgr connect_timeout=2
     2  | node2 | primary | * running |          | default  | host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2
    
    WARNING: following issues were detected
      - node "node1" (ID: 1) is running but the repmgr node record is inactive
    

    问题来了,node1、node2查看状态时都有 WARNING 了,接下来需要为node1 的 postgresql 设置新的 master。

    node 1 节点上关闭 postgresql

    $ sudo pg_ctlcluster 9.6 main stop
    

    使用 repmgr node rejoin 添加到集群里,选项可以使用的是 pg_rewind。
    (This can optionally use pg_rewind to re-integrate a node which has diverged from the rest of the cluster, typically a failed primary.)

    $ repmgr -f /etc/repmgr.conf node rejoin -d 'host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --dry-run --verbose
    
    NOTICE: using provided configuration file "/etc/repmgr.conf"
    INFO: prerequisites for using pg_rewind are met
    INFO: 0 files would have been copied to "/tmp/repmgr-config-archive-pgsql96"
    INFO: temporary archive directory "/tmp/repmgr-config-archive-pgsql96" deleted
    INFO: pg_rewind would now be executed
    DETAIL: pg_rewind command is:
      /usr/lib/postgresql/9.6/bin/pg_rewind -D '/var/lib/postgresql/9.6/main' --source-server='host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2'
    INFO: prerequisites for executing NODE REJOIN are met
    
    $ repmgr -f /etc/repmgr.conf node rejoin -d 'host=192.168.56.102 user=repmgr dbname=repmgr connect_timeout=2' --force-rewind --verbose
    
    NOTICE: using provided configuration file "/etc/repmgr.conf"
    INFO: prerequisites for using pg_rewind are met
    INFO: 0 files copied to "/tmp/repmgr-config-archive-pgsql96"
    NOTICE: executing pg_rewind
    NOTICE: 0 files copied to /var/lib/postgresql/9.6/main
    INFO: directory "/tmp/repmgr-config-archive-pgsql96" deleted
    INFO: deleting "recovery.done"
    NOTICE: setting node 1's primary to node 2
    NOTICE: starting server using "sudo pg_ctlcluster 9.6 main start"
    INFO: demoted primary is pingable
    INFO: node 1 has attached to its upstream node
    NOTICE: NODE REJOIN successful
    DETAIL: node 1 is now attached to node 2
    

    符合预期。

    参考:
    https://www.2ndquadrant.com/en/resources/repmgr/
    https://github.com/2ndQuadrant/repmgr
    https://repmgr.org/docs/4.1/repmgr-administration-manual.html
    https://repmgr.org/docs/4.1/repmgr-node-rejoin.html

  • 相关阅读:
    [DDCTF 2019]homebrew event loop
    [极客大挑战 2019]FinalSQL
    $[HAOI2008]$硬币购物
    $2018/8/19 = Day5$学习笔记 + 杂题整理
    $2018/8/16 = Day2$学习笔记$+$杂题整理
    [NOIp2009] $Hankson$の趣味题
    2018清北学堂夏日培训游记
    2.数组的声明和创建
    1.什么是数组?
    15.递归
  • 原文地址:https://www.cnblogs.com/ctypyb2002/p/9792868.html
Copyright © 2011-2022 走看看