postgresql是使用Streaming Replication来实现热备份的,热备份的作用如下:
- 灾难恢复
- 高可用性
- 负载均衡,当你使用Streaming Replication来实现热备份(hot standby)的时候,可以再standby上执行查询语句,也只允许执行select
那么,当我们有大量使用了流复制的机器之后, 监控 Streaming Replication 的正常运行,在我们的部署中是非常重要的。
那么,我们会有下面的监控问题:
- 如何更好的监控流复制(Streaming Replication)
- 监控它们最好的方法是什么
- 除了使用 Master的
pg_stat_replication
视图 监控,还有什么在standby上可用的方法来监控流复制 - 如何计算 replication 滞后时间,以秒、分钟为单位。
针对上面几个常见的问题,下面是一些我认为比较有用的方法。
-
master/primary server 上的
pg_stat_replication
视图pid: walsender process的进程ID usesysid: 执行流复制的用户的OID usename: 执行流复制的用户的用户名 application_name: 连接到master的Application name client_addr: standby/streaming replication的ip地址 client_hostname: Hostname of standby. client_port: standby上的TCP port backend_start: 从数据第一次连接master的时间 state: 当前WAL sender状态 i.e streaming sent_location: Last transaction location sent to standby. write_location: Last transaction written on disk at standby flush_location: Last transaction flush on disk at standby. replay_location: Last transaction flush on disk at standby. sync_priority: standby服务器的优先级 sync_state: standby的同步类型( async/synchronous)(异步/同步). e.g.: postgres=# x Expanded display is on. postgres=# select * from pg_stat_replication; -[ RECORD 1 ]----+------------------------------ pid | 19597 usesysid | 16384 usename | repl application_name | walreceiver client_addr | 210.61.161.183 client_hostname | client_port | 50474 backend_start | 2015-02-04 11:07:27.137356+08 state | streaming sent_location | 4/E059E560 write_location | 4/E059E560 flush_location | 4/E059E560 replay_location | 4/E059BEB0 sync_priority | 0 sync_state | async
-
select pg_is_in_recovery();
, 这个函数在standby执行,会告诉你,是否处于recovery 模式!e.g. # standby处于复制状态,返回 t, 否则返回 f postgres=# select pg_is_in_recovery(); pg_is_in_recovery ------------------- t (1 row) # 下面不是standby的例子 postgres=# select pg_is_in_recovery(); pg_is_in_recovery ------------------- f (1 row)
-
select pg_last_xlog_replay_location();
, 同样是在standby上执行,显示recovery 过程中的最近一个事务。e.g. postgres=# select pg_last_xlog_replay_location(); pg_last_xlog_replay_location ------------------------------ 0/27099838 (1 row)
-
select pg_last_xlog_receive_location();
, standby上执行,standby最后接收到的事务日志,并且已经同步写到硬盘的.e.g. postgres=# select pg_last_xlog_receive_location(); pg_last_xlog_receive_location ------------------------------- 0/2709CB70 (1 row)
-
select pg_last_xact_replay_timestamp();
, standby上执行,recovery过程中最后一个事务执行的时间e.g. postgres=# select pg_last_xact_replay_timestamp(); pg_last_xact_replay_timestamp ------------------------------- 2015-02-09 19:48:57.916245+08 (1 row)
接下来的问题,是如何正确的在master和standby上监控Streaming Replication:
standby上的监控:
-
select pg_is_in_recovery(); 判断是否处于recovery模式
-
查看recovery的延时情况:
SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay; # 如果receive和replay是同一个位置,延时为0;否则当前时间减去最后一个事务的时间为延时 log_delay ----------- 0 (1 row)
-
pg_last_xact_replay_timestamp
和pg_last_xlog_replay_location
判断recovery是否处于工作状态。当Streaming Replication在复制的时候,replay_timestamp和pg_last_xlog_replay_location会一直增长。postgres=# select pg_last_xact_replay_timestamp(); pg_last_xact_replay_timestamp ------------------------------- 2015-02-09 20:53:54.48081+08 (1 row) postgres=# select pg_last_xact_replay_timestamp(); pg_last_xact_replay_timestamp ------------------------------- 2015-02-09 20:53:55.456179+08 postgres=# select pg_last_xlog_replay_location(); pg_last_xlog_replay_location ------------------------------ 5/723E528 (1 row) postgres=# select pg_last_xlog_replay_location(); pg_last_xlog_replay_location ------------------------------ 5/72514B8 (1 row)
master上的监控:
-
查看pg_stat_replication 中的状态呢,使用
postgres=# select * from pg_stat_replication; -[ RECORD 1 ]----+------------------------------ pid | 19597 usesysid | 16384 usename | repl application_name | walreceiver client_addr | 210.61.161.183 client_hostname | client_port | 50474 backend_start | 2015-02-04 11:07:27.137356+08 state | streaming sent_location | 5/64046A8 write_location | 5/64046A8 flush_location | 5/64046A8 replay_location | 5/64027B0 sync_priority | 0 sync_state | async
-
在master判断recovery的滞后程度,以字节为单位
postgres=# select pg_xlog_location_diff(sent_location, replay_location) from pg_stat_replication; pg_xlog_location_diff ----------------------- 1968 (1 row) postgres=# select pg_xlog_location_diff(sent_location, replay_location) from pg_stat_replication; pg_xlog_location_diff ----------------------- 1488
参考:
Postgresql The Statistics Collector
System Administration Functions