目前环境有一套6节点2数据中心的cassandra集群,版本为2.1.9。
今天将集群中一台机器10.168.12.3重启后发现该节点无法加入集群,现象分析。
在重启后的节点查看集群状态,发现集群状态一切正常。
$ nodetool status Datacenter: DC-SGM-DR ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.168.50.205 822.91 MB 256 ? bea84e24-76c8-4070-9c41-d0051d8aba63 RAC-1B UN 10.168.50.212 825.43 MB 256 ? 97e92d11-028a-44f6-b6ea-be3992985506 RAC-1B UN 10.168.50.213 14.37 GB 256 ? de47960c-54ab-4ed3-99e7-e3abcb66c014 RAC-1B Datacenter: DC-SGM-SH ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.168.11.11 10.17 GB 256 ? 9d016b9f-5655-4899-8652-607bdc24eda3 RAC-1A UN 10.168.12.3 831.42 MB 256 ? 57c4d98b-c52c-48bf-b8ee-7d8f22bcc08f RAC-1A UN 10.168.11.6 828.2 MB 256 ? 9cf69121-4dbc-419c-b3a8-e166d83b4177 RAC-1A Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
我们登录集群其他节点查看集群状态
$ nodetool status Datacenter: DC-SGM-DR ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.168.50.205 828.16 MB 256 ? bea84e24-76c8-4070-9c41-d0051d8aba63 RAC-1B UN 10.168.50.212 825.43 MB 256 ? 97e92d11-028a-44f6-b6ea-be3992985506 RAC-1B UN 10.168.50.213 14.37 GB 256 ? de47960c-54ab-4ed3-99e7-e3abcb66c014 RAC-1B Datacenter: DC-SGM-SH ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.168.11.11 834.48 MB 256 ? 9d016b9f-5655-4899-8652-607bdc24eda3 RAC-1A DN 10.168.12.3 831.31 MB 256 ? 57c4d98b-c52c-48bf-b8ee-7d8f22bcc08f RAC-1A UN 10.168.11.6 828.17 MB 256 ? 9cf69121-4dbc-419c-b3a8-e166d83b4177 RAC-1A Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
我们发现集群其他节点显示被重启的节点为“DN”状态,并在各节点的cassandra的system.log文件报错
.................................................. WARN [GossipStage:1] 2020-01-02 10:07:45,831 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:07:47,680 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:07:49,682 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:07:50,690 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:07:50,833 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:07:51,681 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:07:51,833 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:07:52,833 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:07:54,684 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:07:55,683 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:07:55,834 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:07:57,683 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:07:58,684 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:00,684 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:01,688 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:05,686 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:06,686 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:08,838 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:09,839 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:11,688 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:11,839 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:12,840 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:13,688 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:17,841 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:20,690 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:21,691 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:21,843 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:22,691 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 WARN [GossipStage:1] 2020-01-02 10:08:22,843 Gossiper.java:1105 - received an invalid gossip generation for peer /10.168.12.3; local generation = 1527840276, received generation = 1577928397 ..................................................
我们登录被重启的cassandra节点查看gossipinfo
$ nodetool gossipinfo ................................. /10.168.12.3 generation:1527840276 heartbeat:22488596 HOST_ID:57c4d98b-c52c-48bf-b8ee-7d8f22bcc08f SCHEMA:54b29ca7-5a9c-345b-be73-437504faf71b SEVERITY:0.0 NET_VERSION:8 RACK:RAC-1A DC:DC-SGM-SH RELEASE_VERSION:2.1.9 STATUS:NORMAL,-101651619030947983 RPC_ADDRESS:10.168.12.3 LOAD:8.72963151E8 .................................
可以看到其他节点记录重启节点的generation的epoch为1527840276,我们转换成可读时间为2018年6月1日FridayAM8点04分,该时间为我们启动cassandra的时间,登录重启节点,查看local表的
cqlsh `hostname` -u cassandra cassandra@cqlsh> use system; cassandra@cqlsh:system> select key , gossip_generation from local ; key | gossip_generation -------+------------------- local | 1577928397 (1 rows)
将1577928397转换为2020年1月2日ThursdayAM1点26分,可以看到两个时间点之间间隔一年半时间,也就是说上次cassandra启动的时间还是2018年6月1日FridayAM8点04分,其实这次重启触发了一个cassandra的bug
https://issues.apache.org/jira/browse/CASSANDRA-10969
可以查看大牛写的blog
我们依次将集群节点重启。