在云平台的运维中,有时会遇到某个节点的一个IP无法连通导致云平台故障,比较常见的是ceph的某个osd节点storage网络不通导致osd全部down.为了快速检测云平台全部网络的连通性,利用ansible自带的的fact,写了个playbook,特记录下.
1 每台主机三个网卡
2 脚本内容
---
- hosts: all
#vars_prompt:
# - name: share_user
# prompt: "input share_user"
tasks:
- block
- name: restart acpid service
service: name=acpid state=restarted
- name: get the network connection ip
shell: |
ping -c 2 "{{ hostvars[item[0]]['ansible_' + item[1]].ipv4.address }}"
register: netinfo
ignore_errors: yes
with_nested:
- "{{ groups['all'] }}"
- ["eth0","eth1","eth2"]
#- debug:
# var: netinfo
- name: echo the no ping ip
shell: echo "ip {{item.cmd}} is no ok" >>/root/noping.txt
with_items:
- "{{ netinfo.results }}"
when: item.rc != 0
delegate_to: localhost
3 测试
3.1 关掉minion1的eth2网卡
3.2 执行检测脚本
3.3 测试结果
补充一个playbook,利用fact的ansible_all_ipv4_addresses变量
---
- hosts: all
become: yes
become_user: root
become_method: sudo
tasks:
- block:
- name: check the net connection(ping)
shell: ping -c 2 {{ item }}
register: netResult
ignore_errors: yes
with_items:
- "{{ ansible_all_ipv4_addresses }}"
when: item !='240.0.0.1' #增加过滤不想检测的IP
- debug:
var=netResult
- name: get the no ping ipadress
shell: |
echo "ip {{ item['item'] }} is unreachable" >>/root/noPing.txt
with_items:
"{{ netResult.results }}"
when: item.item !='240.0.0.1' and item.rc !=0
delegate_to: localhost