因为有客户需求,所以必须尝试一下,可悲的是手里只有3.7的离线安装文档,加上之前3.11安装因为同事文档写得太好,基本没遇到什么坑,所以就没仔细研究就开始搞了。
结果果然是因为/etc/ansible/host文件写得有问题,遇到一堆问题,记录一下了。
1.遇到问题记录
- 镜像不ready
镜像不ready,虽然都pull下来了,可是没仔细看文档,就save -o了文档中的那几个,所以就造成下面的错误,只好重新开始下载
One or more required container images are not available: openshift3/registry-console:v3.6, registry.example.com/openshift3/ose-deployer:v3.6.173.0.130, registry.example.com/openshift3/ose-docker-registry:v3.6.173.0.130, registry.example.com/openshift3/ose-haproxy-router:v3.6.173.0.130, registry.example.com/openshift3/ose-pod:v3.6.173.0.130 Checked with: skopeo inspect [--tls-verify=false] [--creds=<user>:<pass>] docker://<registry>/<image> Default registries searched: registry.example.com, registry.access.redhat.com Failed connecting to: registry.example.com, registry.access.redhat.com
- registry 443端口没配
学3.11安装配了个80以为可以绕信过关,结果就报错了
[root@master ~]# oc logs registry-console-1-deploy -n default --> Scaling registry-console-1 to 1 --> Waiting up to 10m0s for pods in rc registry-console-1 to become ready E1114 13:34:58.912499 1 reflector.go:304] github.com/openshift/origin/pkg/deploy/strategy/support/lifecycle.go:509: Failed to watch *api.Pod: Get https://172.30.0.1:443/api/v1/namespaces/default/pods?labelSelector=deployment%3Dregistry-console-1%2Cdeploymentconfig%3Dregistry-console%2Cname%3Dregistry-console&resourceVersion=1981&timeoutSeconds=412&watch=true: dial tcp 172.30.0.1:443: getsockopt: connection refused
- registry-catalog需要retag一下
pull service-catalog的镜像出问题,这个是个大坑啊,每次一装就需要1个多钟头,类似错误如下
15m 13m 4 kubelet, master.example.com spec.containers{apiserver} Normal Pulling pulling image "registry.access.redhat.com/openshift3/ose-service-catalog:v3.6" 15m 13m 4 kubelet, master.example.com spec.containers{apiserver} Warning Failed Failed to pull image "registry.access.redhat.com/openshift3/ose-service-catalog:v3.6": rpc error: code = 2 desc = All endpoints blocked. 15m 13m 6 kubelet, master.example.com spec.containers{apiserver} Normal BackOff Back-off pulling image "registry.access.redhat.com/openshift3/ose-service-catalog:v3.6" 15m 4m 46 kubelet, master.example.com Warning FailedSync Error syncing pod
解决办法如下:
docker pull registry.example.com/openshift3/registry-console:v3.6.173.0.130 docker tag registry.example.com/openshift3/registry-console:v3.6.173.0.130 registry.example.com/openshift3/registry-console:v3.6 docker push registry.example.com/openshift3/registry-console:v3.6
- 配置了yum但找不到docker
master上安装docker找不到,大家都是配置同样的yum repository,后来只好通过联网方式的subscription-manager注册解决。
- apiserver的pod虽然启动,但是无法连上,报错信息
curl: (6) Could not resolve host: apiserver.kube-service-catalog.svc; Unknown error
通过修改./etc/resolv.conf为
[root@node2 ~]# cat /etc/resolv.conf # nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh # Generated by NetworkManager search cluster.local example.com nameserver 192.168.0.105
3.6不像3.11有一个Prequrest的check,这个直接安装上来,就需要一直等他是否出错的信息了,所以每次安装很长时间。
host文件的选项可以参考,踩坑必看啊。
https://docs.okd.io/3.6/install_config/install/advanced_install.html#enabling-service-catalog
- 安装完成没有看到metrics等组件
安装完成最后的log
TASK [openshift_excluder : Enable openshift excluder] ******************************************************************************************************************* changed: [node1.example.com] changed: [master.example.com] changed: [node2.example.com] PLAY RECAP ************************************************************************************************************************************************************** localhost : ok=15 changed=0 unreachable=0 failed=0 master.example.com : ok=740 changed=72 unreachable=0 failed=0 nfs.example.com : ok=91 changed=3 unreachable=0 failed=0 node1.example.com : ok=250 changed=18 unreachable=0 failed=0 node2.example.com : ok=250 changed=18 unreachable=0 failed=0
检查只有这么几个pod,设置的metrics都没有上来,一定是hosts文件出了问题。
[root@master ~]# oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-x0hlq 1/1 Running 7 2d default registry-console-2-p84p6 1/1 Running 2 1d default router-10-ttqq9 0/1 MatchNodeSelector 0 1d default router-12-rfpxc 1/1 Running 1 1d kube-service-catalog apiserver-3ls5x 1/1 Running 1 1d kube-service-catalog controller-manager-7zdbc 0/1 CrashLoopBackOff 1 1d
[root@master ~]# oc get nodes NAME STATUS AGE VERSION master.example.com Ready 2d v1.6.1+5115d708d7 node1.example.com Ready 2d v1.6.1+5115d708d7 node2.example.com Ready 2d v1.6.1+5115d708d7
- 卸载脚本
ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/adhoc/uninstall.yml;
- DNS无法启动导致atomic-openshift-node.service服务启动失败
Nov 17 18:55:51 master.example.com atomic-openshift-node[32772]: I1117 18:55:51.787479 32772 mount_linux.go:203] Detected OS with systemd Nov 17 18:55:51 master.example.com atomic-openshift-node[32772]: I1117 18:55:51.787497 32772 docker.go:364] Connecting to docker on unix:///var/run/docker.sock Nov 17 18:55:51 master.example.com atomic-openshift-node[32772]: I1117 18:55:51.787510 32772 docker.go:384] Start docker client with request timeout=2m0s Nov 17 18:55:51 master.example.com atomic-openshift-node[32772]: W1117 18:55:51.789279 32772 cni.go:157] Unable to update cni config: No networks found in /etc/cni/net.d Nov 17 18:55:51 master.example.com atomic-openshift-node[32772]: F1117 18:55:51.798668 32772 start_node.go:140] could not start DNS, unable to read config file: open /etc/origin/node/resolv.conf: no such file or directory Nov 17 18:55:51 master.example.com systemd[1]: atomic-openshift-node.service: main process exited, code=exited, status=255/n/a Nov 17 18:55:51 master.example.com systemd[1]: Failed to start OpenShift Node. -- Subject: Unit atomic-openshift-node.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit atomic-openshift-node.service has failed.
解决方案,拷贝一个resolv.conf文件
[root@master ansible]# cd /etc/origin/node [root@master node]# ls ca.crt node-dnsmasq.conf server.key system:node:master.example.com.key node-config.yaml server.crt system:node:master.example.com.crt system:node:master.example.com.kubeconfig [root@master node]# cp /etc/resolv.conf .
- Router启动失败,经过分析,发现deploy到node2.example.com的时候失败,原因是bind不到443端口
[root@master node]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE router-1-deploy 0/1 Error 0 30m 10.129.0.14 node2.example.com router-2-55bpf 1/1 Running 0 5m 192.168.0.104 node1.example.com router-2-deploy 1/1 Running 0 5m 10.128.0.14 node1.example.com router-2-dw31q 1/1 Running 0 5m 192.168.0.103 master.example.com router-2-xn9cp 0/1 CrashLoopBackOff 6 5m 192.168.0.105 node2.example.com [root@master node]# oc logs router-2-xn9cp I1117 12:19:27.665452 1 template.go:246] Starting template router (v3.6.173.0.130) I1117 12:19:27.679413 1 metrics.go:43] Router health and metrics port listening at 0.0.0.0:1936 I1117 12:19:27.700732 1 router.go:240] Router is including routes in all namespaces E1117 12:19:27.777551 1 ratelimiter.go:52] error reloading router: exit status 1 [ALERT] 320/121927 (45) : Starting frontend public_ssl: cannot bind socket [0.0.0.0:443]
问题分析: registry在node2上也是bind 443端口,估计冲突了,所以修改ansible,删除node2的route属性。
把监控功能上上去,又修改了一把hosts文件,最后安装成功的hosts文件参考如下:
# Create an OSEv3 group that contains the masters and nodes groups [OSEv3:children] masters nodes etcd nfs [OSEv3:vars] ansible_ssh_user=root openshift_deployment_type=openshift-enterprise osm_cluster_network_cidr=10.128.0.0/14 openshift_portal_net=172.30.0.0/16 openshift_master_api_port=8443 openshift_master_console_port=8443 openshift_hosted_registry_storage_kind=nfs openshift_hosted_registry_storage_access_modes=['ReadWriteMany'] openshift_hosted_registry_storage_nfs_directory=/exports openshift_hosted_registry_storage_nfs_options='*(rw,root_squash)' openshift_hosted_registry_storage_volume_name=registry openshift_hosted_registry_storage_volume_size=10Gi oreg_url=registry.example.com/openshift3/ose-${component}:${version} openshift_docker_additional_registries=registry.example.com openshift_docker_insecure_registries=registry.example.com openshift_docker_blocked_registries=registry.access.redhat.com,docker.io openshift_image_tag=v3.6.173.0.130 openshift_enable_service_catalog=true openshift_service_catalog_image_prefix=registry.example.com/openshift3/ose- openshift_service_catalog_image_version=v3.6.173.0.130 ansible_service_broker_image_prefix=registry.example.com/openshift3/ose- ansible_service_broker_etcd_image_prefix=registry.example.com/rhel7/ template_service_broker_prefix=registry.example.com/openshift3/ oreg_url=registry.example.com/openshift3/ose-${component}:${version} openshift_examples_modify_imagestreams=true openshift_clock_enabled=true openshift_metrics_storage_kind=nfs openshift_metrics_install_metrics=true openshift_metrics_storage_access_modes=['ReadWriteOnce'] openshift_metrics_storage_host=nfs.example.com openshift_metrics_storage_nfs_directory=/exports openshift_metrics_storage_volume_name=metrics openshift_metrics_storage_volume_size=10Gi openshift_metrics_hawkular_hostname=hawkular-metrics.apps.example.com #openshift_metrics_cassandra_storage_type=emptydir openshift_metrics_image_prefix=registry.example.com/openshift3/ openshift_hosted_metrics_deploy=true openshift_hosted_metrics_public_url=https://hawkular-metrics.apps.example.com/hawkular/metrics openshift_metrics_image_version=v3.6.173.0.130 openshift_template_service_broker_namespaces=['openshift'] template_service_broker_selector={"node": "true"} openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/htpasswd'}] # Default login account: admin / handhand openshift_master_htpasswd_users={'admin': '$apr1$gfaL16Jf$c.5LAvg3xNDVQTkk6HpGB1'} #openshift_repos_enable_testing=true openshift_disable_check=docker_image_availability,disk_availability,memory_availability,docker_storage docker_selinux_enabled=false openshift_docker_options=" --selinux-enabled --insecure-registry 172.30.0.0/16 --log-driver json-file --log-opt max-size=50M --log-opt max-file=3 --insecure-registry registry.example.com --add-registry registry.example.com" osm_etcd_image=rhel7/etcd openshift_logging_image_prefix=registry.example.com/openshift3/ openshift_hosted_router_selector='region=infra,router=true' openshift_master_default_subdomain=app.example.com # host group for masters [masters] master.example.com # host group for etcd [etcd] master.example.com # host group for nodes, includes region info [nodes] master.example.com openshift_node_labels="{'region': 'infra', 'router': 'true', 'zone': 'default'}" openshift_schedulable=true node1.example.com openshift_node_labels="{'region': 'infra', 'router': 'true', 'zone': 'default'}" openshift_schedulable=true node2.example.com openshift_node_labels="{'region': 'infra', 'zone': 'default'}" openshift_schedulable=true [nfs] nfs.example.com
安装完成后拿最后的hosts文件又装了一遍,这次终于全部都出来了
[root@master ~]# oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-p8p0s 1/1 Running 2 2h default registry-console-1-t4bw2 1/1 Running 0 1h default router-1-1nnt3 1/1 Running 2 2h default router-1-4h8tg 1/1 Running 3 2h kube-service-catalog apiserver-z6nmz 1/1 Running 2 1h kube-service-catalog controller-manager-d2jgc 1/1 Running 0 1h openshift-infra hawkular-cassandra-1-m6r4x 1/1 Running 0 1h openshift-infra hawkular-metrics-4j828 1/1 Running 1 1h openshift-infra heapster-rgwrw 1/1 Running 6 2h
查看pv,pvc
[root@master ~]# oc get pv,pvc NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM STORAGECLASS REASON AGE pv/registry-volume 10Gi RWX Retain Bound default/registry-claim 26m NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE pvc/registry-claim Bound registry-volume 10Gi RWX 26m
2.批量存镜像脚本
for i in $(docker images |awk '{print $1":"$2}'); do imagename=$(echo $i | awk -F '/' {'print $3'} | awk -F ':' {'print $1'}); # imagename=$($i |awk -F '/' {'print $3'} | awk -F ':' {'print $1'}); echo $imagename; # echo docker save $1 | gzip -c > /root/images/$imagename.tar.gz; docker save $i | gzip -c > /root/images/$imagename.tar.gz; done;
3. 镜像放在单独的盘
Virtualbox 添加一个新盘,然后通过
fdisk -l
找到相应的设备,比如/dev/sdb
格式化
echo "n p 1 w" | fdisk /dev/sdb;
创建vg
pvcreate /dev/sdb1;
vgcreate docker-vg /dev/sdb1;
docker使用docker-vg
vgs; cat <<EOF > /etc/sysconfig/docker-storage-setup VG=docker-vg EOF docker-storage-setup lvextend -l 100%VG /dev/docker-vg/docker-pool touch /etc/containers/registries.conf systemctl start docker systemctl enable docker lvs getenforce
4.ocp.repo文件
[root@master ~]# cat /etc/yum.repos.d/ocp.repo [server] name=server baseurl=http://192.168.56.103:8080/repo/rhel-7-server-rpms/ enabled=1 gpgcheck=0 [datapath] name=datapath baseurl=http://192.168.56.103:8080/repo/rhel-7-fast-datapath-rpms/ enabled=1 gpgcheck=0 [extra] name=extra baseurl=http://192.168.56.103:8080/repo/rhel-7-server-extras-rpms/ enabled=1 gpgcheck=0 [ose] name=ose baseurl=http://192.168.56.103:8080/repo/rhel-7-server-ose-3.6-rpms/ enabled=1 gpgcheck=0
5.主要安装步骤记录
systemctl stop firewalld systemctl disable firewalld systemctl mask firewalld setenforce 0; sed -i 's/^SELINUX=.*/SELINUX=permissive/' /etc/selinux/config yum clean all yum repolist yum install -y docker yum -y install wget git net-tools bind-utils iptables-services bridge-utils bash-completion vim atomic-openshift-excluder atomic-openshift-docker-excluder lrzsz unzip atomic-openshift-utils; yum -y install python-setuptools yum -y update; ssh-keygen ssh-copy-id root@master.example.com ssh-copy-id root@node1.example.com ssh-copy-id root@node2.example.com echo "n p 1 w" | fdisk /dev/sdb; pvcreate /dev/sdb1; vgcreate docker-vg /dev/sdb1; vgs; cat <<EOF > /etc/sysconfig/docker-storage-setup VG=docker-vg EOF docker-storage-setup lvextend -l 100%VG /dev/docker-vg/docker-pool touch /etc/containers/registries.conf systemctl start docker systemctl enable docker lvs getenforce yum -y install docker-distribution; systemctl enable docker-distribution; systemctl start docker-distribution;
service catalog灰色,technology preview版本一望就知。