当cosbench的测试莫名其妙的terminated了,而且时而发生,时而不发生,mission log里也看不出什么信息,记得看一眼system log.
如果发现这个call stack, 那么请注意,很可能这次测试的失败是由于controller和drivers与storage cluster 之间的时间不同步引起的。
2020-08-13 02:19:28,977 [ERROR] [AbstractAgent] - unexpected exception java.lang.ArrayIndexOutOfBoundsException: -9626 at com.intel.cosbench.bench.Counter.doAdd(Counter.java:65) at com.intel.cosbench.driver.model.OperatorContext.doAddSample(OperatorContext.java:76) at com.intel.cosbench.driver.model.OperatorContext.addSample(OperatorContext.java:70) at com.intel.cosbench.driver.agent.WorkAgent.onSampleCreated(WorkAgent.java:211) at com.intel.cosbench.driver.operator.Preparer.operate(Preparer.java:99) at com.intel.cosbench.driver.operator.AbstractOperator.operate(AbstractOperator.java:76) at com.intel.cosbench.driver.agent.WorkAgent.performOperation(WorkAgent.java:197) at com.intel.cosbench.driver.agent.WorkAgent.doWork(WorkAgent.java:177) at com.intel.cosbench.driver.agent.WorkAgent.execute(WorkAgent.java:134) at com.intel.cosbench.driver.agent.AbstractAgent.call(AbstractAgent.java:44) at com.intel.cosbench.driver.agent.AbstractAgent.call(AbstractAgent.java:1) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2020-08-13 02:19:28,977 [ERROR] [MissionHandler] - detected workers [19, 20, 21, 22, 23, 24] have encountered errors 2020-08-13 02:19:28,979 [INFO] [MissionHandler] - mission M2E66EA747D has been terminated |
当你在controller的system.log中发现如下的记录,那么说明这次测试的terminate很可能是由于controller与drivers之间的时间不同步引起的。
2020-08-20 10:44:59,277 [WARN] [PingDriverRunner] - The driver driver1 at http://10.246.21.82:18088/driver is not reachable at the 1 time, with error message: Connection refused (Connection refused) 2020-08-20 17:47:37,351 [ERROR] [AbstractCommandTasklet] - driver report error: HTTP 400 - no such key defined: sizes 2020-08-20 17:47:37,359 [ERROR] [StageRunner] - detected tasks [t7, t8, t9, t10, t11, t12] have encountered errors 2020-08-20 17:47:37,366 [ERROR] [Aborter] - fail to abort driver |
进一步的排查,可以使用下面的命令,来让controller和drivers同时返回本地时间,一边让时间的差距一目了然。如果不这么做,则很难明确几台机器上的的时间差距是不是输入间隔命令的那几秒造成的。
# date && ssh root@10.246.21.82 date && ssh root@10.246.21.83 date
首先,确保controller与driver在同一个时区之内。
可以看到这台controller的时区是UTC,而我们应该改成与其他drivers一样的New_York.
# timedatectl list-timezones | grep York
# timedatectl set-zimezone America/New_York
使用下面的命令来在CentOS 7上进行time sync.
先检查NTP的状态:
修改NTP的配置文件。
# vi /etc/ntp.conf
添加一条本地的NTP的服务器的信息,如下的两行:
server 172.16.199.1
server 10.254.140.22
检查ntp服务的状态:
# systemctl status ntpd
举例:
停掉ntp服务:
# systemctl stop ntpd
如果不停掉ntp服务的话,是没办法与服务器同步时间的。会报错:”the NTP socket is in use, exiting”
检查ntp服务的状态:
强制时间与ntp服务器同步。
# ntpdate 10.254.140.22
或者
# ntpd -gq
下图就是一个时间同步成功了之后的输出。
或:
再启动ntp服务。
# systemctl start ntpd.service
再检查一下NTP服务的状态,可以看到time已经sync了。
参考资料
==============
https://github.com/intel-cloud/cosbench/issues/264
https://www.thegeekdiary.com/centos-rhel-how-to-configure-ntp-server-and-client/
https://www.golinuxhub.com/2017/12/how-to-forcefully-sync-date-and-time/
https://www.thegeekdiary.com/centos-rhel-6-how-to-force-a-ntp-sync-with-the-ntp-servers/