zoukankan      html  css  js  c++  java
  • 基于 systemd 创建 Linux service 启动顺序和检测故障重启

    背景

    团队基于 Armbian 设计了一个 LoRa 网关,它要求上电后开始运行主程序 packet_forwarder (它实现 LoRa<-(转)->UDP 与服务器通信)。
    这本来是一个简单的需求,将其设计成一个 service 加载到 systemd 中就可以完成,该 rime_gateway.service 代码如下:

    [Unit]
    Description=Rime LoRaWAN Gateway
    
    [Service]
    WorkingDirectory=/home/rime/packet_forwarder/lora_pkt_fwd
    ExecStart=/home/rime/packet_forwarder/lora_pkt_fwd/start_gateway.sh
    Restart=always
    
    [Install]
    WantedBy=multi-user.target
    

    语法解释请参考 Systemd 入门教程:命令篇

    不稳定的服务

    当使用 systemctl start rime_gateway.service 手动启动时,它工作得很好。

    然而,当 Armbian 上电自启动后,使用 systemctl status rime_gateway.service 查看发现该服务已经停止工作:

    rime_gateway.service - Rime LoRaWAN Gateway
       Loaded: loaded (/lib/systemd/system/rime_gateway.service; enabled; vendor preset: enabled)
       Active: failed (Result: exit-code) since Mon 2020-04-20 06:51:46 UTC; 29s ago
      Process: 1112 ExecStart=/home/rime/packet_forwarder/lora_pkt_fwd/start_gateway.sh (code=exited, status=1/FAILURE)
     Main PID: 1112 (code=exited, status=1/FAILURE)
    
    Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=100ms expired, scheduling restart.
    Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 5.
    Apr 20 06:51:46 orangepizero systemd[1]: Stopped Rime LoRaWAN Gateway.
    Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Start request repeated too quickly.
    Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
    Apr 20 06:51:46 orangepizero systemd[1]: Failed to start Rime LoRaWAN Gateway.
    

    上面的语句显示服务重启太快,系统退出重启。

    使用 journalctl -u rime_gateway.service 查看日志,系统以 100ms 间隔 5 次重启都失败。

    -- Logs begin at Mon 2020-04-20 06:51:31 UTC, end at Mon 2020-04-20 06:55:01 UTC. --
    Apr 20 06:51:40 orangepizero systemd[1]: Started Rime LoRaWAN Gateway.
    Apr 20 06:51:40 orangepizero start_gateway.sh[572]: Reset start_gateway.sh
    Apr 20 06:51:41 orangepizero start_gateway.sh[572]: Starting start_gateway.sh
    Apr 20 06:51:41 orangepizero systemd[1]: rime_gateway.service: Main process exited, code=exited, status=1/FAILURE
    Apr 20 06:51:41 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
    Apr 20 06:51:41 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=100ms expired, scheduling restart.
    Apr 20 06:51:41 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 1.
    
    。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
    
    Apr 20 06:51:45 orangepizero start_gateway.sh[1112]: Reset start_gateway.sh
    Apr 20 06:51:46 orangepizero start_gateway.sh[1112]: Starting start_gateway.sh
    Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Main process exited, code=exited, status=1/FAILURE
    Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
    Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=100ms expired, scheduling restart.
    Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 5.
    Apr 20 06:51:46 orangepizero systemd[1]: Stopped Rime LoRaWAN Gateway.
    Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Start request repeated too quickly.
    Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
    Apr 20 06:51:46 orangepizero systemd[1]: Failed to start Rime LoRaWAN Gateway.
    

    查看网关日志,发现失败的原因是网络没有建立成功 tail -f /tmp/start_gateway.sh.log

    ERROR: [up] connect returned Network is unreachable
    

    修改启动顺序

    很明显,该服务依赖于网络的建立,因此,首先添加如下语句

    After=network.target
    

    这个启动顺序生效了吗?为此,我们导出并查看了启动顺序

    systemd-analyze plot > boot.svg
    

    使用 chrome 浏览器打开 boot.svg 发现:先启动 network.target,后启动 rime_gateway.service

    更多启动顺序请参考 Linux systemd启动守护进程,service启动顺序分析及调整service启动顺序

    检测故障重启

    为了让服务更健壮,检测到失败退出时自动重启。为此,添加了如下的代码。

    systemd 将尝试永久重启服务

    StartLimitIntervalSec=0
    

    每隔 1 秒重启服务是个好主意,以避免在出现问题时对服务器施加太大压力。

    RestartSec=1
    

    更多自动重启请参考 使用systemd创建Linux服务

    稳定的服务

    最终的 rime_gateway.service 代码如下所示

    [Unit]
    Description=Rime LoRaWAN Gateway
    After=network.target
    StartLimitIntervalSec=0
    
    [Service]
    WorkingDirectory=/home/rime/packet_forwarder/lora_pkt_fwd
    ExecStart=/home/rime/packet_forwarder/lora_pkt_fwd/start_gateway.sh
    Restart=always
    RestartSec=1
    
    [Install]
    WantedBy=multi-user.target
    

    使用 systemctl status rime_gateway.service 和 journalctl -u rime_gateway.service 查看日志,服务正常启动。

    在异常的情况下,先拔出网线,再重启 Armbian,发现 systemd 以每隔 1 秒间隔启动服务,直到网络恢复正常为止(本案例重启 78 次)。

    -- Logs begin at Mon 2020-04-20 07:32:09 UTC, end at Mon 2020-04-20 07:35:12 UTC. --
    Apr 20 07:32:19 orangepizero systemd[1]: Started Rime LoRaWAN Gateway.
    Apr 20 07:32:20 orangepizero start_gateway.sh[839]: Reset start_gateway.sh
    Apr 20 07:32:20 orangepizero start_gateway.sh[839]: Starting start_gateway.sh
    Apr 20 07:32:20 orangepizero systemd[1]: rime_gateway.service: Main process exited, code=exited, status=1/FAILURE
    Apr 20 07:32:20 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
    Apr 20 07:32:21 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=1s expired, scheduling restart.
    Apr 20 07:32:21 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 1.
    Apr 20 07:32:21 orangepizero systemd[1]: Stopped Rime LoRaWAN Gateway.
    Apr 20 07:32:21 orangepizero systemd[1]: Started Rime LoRaWAN Gateway.
    Apr 20 07:32:22 orangepizero start_gateway.sh[991]: Reset start_gateway.sh
    Apr 20 07:32:22 orangepizero start_gateway.sh[991]: Starting start_gateway.sh
    
    。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
    
    Apr 20 07:34:54 orangepizero systemd[1]: rime_gateway.service: Main process exited, code=exited, status=1/FAILURE
    Apr 20 07:34:54 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
    Apr 20 07:34:55 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=1s expired, scheduling restart.
    Apr 20 07:34:55 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 78.
    Apr 20 07:34:55 orangepizero systemd[1]: Stopped Rime LoRaWAN Gateway.
    Apr 20 07:34:55 orangepizero systemd[1]: Started Rime LoRaWAN Gateway.
    Apr 20 07:34:55 orangepizero start_gateway.sh[2644]: Reset start_gateway.sh
    Apr 20 07:34:56 orangepizero start_gateway.sh[2644]: Starting start_gateway.sh
    
  • 相关阅读:
    [转][黄忠成]Object Builder Application Block (1)
    C#.NET里面抽象类和接口有什么区别
    MVC中AOP思想的体现(四种过滤器)并结合项目案例说明过滤器的实际用法
    NHibernate之(24):探索NHibernate二级缓存(下)
    使用 ES (elasticsearch) 搜索中文
    elasticsearch 中文 term & completion suggester
    uwsgs loading shared libraries: libicui18n.so.58 异常处理
    tensorflow with gpu 环境配置
    Java 多线程执行
    SpringBoot log4j2 异常
  • 原文地址:https://www.cnblogs.com/rimelink/p/12738201.html
Copyright © 2011-2022 走看看