zoukankan      html  css  js  c++  java
  • 虚拟化集群中PBS-Torque的部署

    1. 概述

    本篇博客主要介绍在centos7操作系统集群中部署,配置和使用pbs调度系统的过程

    centos7版本:CentOS Linux release 7.9.2009 (Core)

    pbs调度系统版本:torque-6.1.3.tar.gz

    集群信息:

    使用三个虚拟节点部署pbs系统

    节点名称 节点IP 节点角色 节点服务
    node16 192.168.80.16 管理节点,登陆节点 pbs_server pbs_schd
    node17 192.168.80.17 节点节点 pbs_mom
    node18 192.168.80.18 计算节点 pbs_mom

    本篇只是pbs调度软件torque的基本部署,配置,使用。更加复杂的功能,比如MPI环境,图形界面显示,GPU调度,Munge认证,数据库信息获取,高可用配置,未做详细的探究,后期有时间再进行完善。

    2. 部署

    一般集群都需要时间统一,全局身份认证,这两个步骤在本篇博客不作介绍。

    本篇博客的node16-18,已经通过ldap和sssd实现了全局身份认证。

    同时约定使用node16的一个软件安装目录作为共享目录,共享给node17和node18

    2.1 创建和挂载共享目录

    node16上执行mkdir -p /hpc/torque/6.1.3,该目录用来安装torque软件,共享给其他节点

    执行mkdir -p /hpc/packages/作为编译torque的工作目录

    编辑/etc/exportfs文件,内容如下:

    /hpc 192.168.80.0/24(rw,no_root_squash,no_all_squash)
    /home 192.168.80.0/24(rw,no_root_squash,no_all_squash)
    

    执行:systemctl start nfs && systemctl enable nfs设置nfs启动和开机启动

    执行:exportfs -r,使共享目录即时生效

    在node17,node18上分别执行:

    mkdir -p /hpc
    mount.nfs 192.168.80.16:/hpc /hpc
    mount.nfs 192.168.80.16:/home /home
    

    2.2 部署torque软件

    下载

    wget http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.3.tar.gz
    

    解压

    tar -zxvf torque-6.1.3.tar.gz -C /hpc/packages,解压到编译安装torque的工作目录

    配置安装信息

    # 1. yum安装软件依赖
    yum -y install libxml2-devel boost-devel openssl-devel libtool readline-devel pam-devel hwloc-devel numactl-devel tcl-devel tk-devel
    # yum groupinstall "GNOME Desktop" "Graphical Administrator Tools" 如果编译选项有--enable-gui时,需要安装图像界面依赖
    # 2. configure传参,配置编译安装信息
    ./configure 
    	--prefix=/hpc/torque/6.1.3 
    	--mandir=/hpc/torque/6.1.3/man  
    	--enable-cgroups 
    	--enable-syslog 
    	--enable-drmaa 
    	--enable-gui 
    	--with-xauth 
    	--with-hwloc 
    	--with-pam 
    	--with-tcl 
    	--with-tk 
    	# --enable-numa-support 	#这个参数加上,要编辑mom.layout,不清楚规则,暂时取消
    # 3. 更新。后期可能添加对MPI,GPU,Munge认证,高可用配置等的支持,本篇后期补充
    

    执行结束后:

    Building components: server=yes mom=yes clients=yes
                         gui=yes drmaa=yes pam=yes
    PBS Machine type    : linux
    Remote copy         : /usr/bin/scp -rpB
    PBS home            : /var/spool/torque
    Default server      : node13
    
    Unix Domain sockets : 
    Linux cpusets       : no
    Tcl                 : -L/usr/lib64 -ltcl8.5 -ldl -lpthread -lieee -lm
    Tk                  : -L/usr/lib64 -ltk8.5 -lX11 -L/usr/lib64 -ltcl8.5 -ldl -lpthread -lieee -lm
    Authentication      : trqauthd
    
    Ready for 'make'.
    

    编译和安装

    # 1. 编译,安装
    make -j4 && make install
    # 2. 生成安装脚本
    # make packages
    

    执行make packages输出:

    这一步可以不做。在本篇博客中,工作目录在共享文件系统,因此只需要在每个节点执行make install即可。

    [root@node16 torque-6.1.3]# make packages
    Building packages from /hpc/packages/torque-6.1.3/tpackages
    rm -rf /hpc/packages/torque-6.1.3/tpackages
    mkdir /hpc/packages/torque-6.1.3/tpackages
    Building ./torque-package-server-linux-x86_64.sh ...
    libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
    Building ./torque-package-mom-linux-x86_64.sh ...
    libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
    Building ./torque-package-clients-linux-x86_64.sh ...
    libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
    Building ./torque-package-gui-linux-x86_64.sh ...
    Building ./torque-package-pam-linux-x86_64.sh ...
    libtool: install: warning: remember to run `libtool --finish /lib64/security'
    Building ./torque-package-drmaa-linux-x86_64.sh ...
    libtool: install: warning: relinking `libdrmaa.la'
    libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
    Building ./torque-package-devel-linux-x86_64.sh ...
    libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
    Building ./torque-package-doc-linux-x86_64.sh ...
    Done.
    
    The package files are self-extracting packages that can be copied
    and executed on your production machines.  Use --help for options.
    

    执行 libtool --finish /hpc/torque/6.1.3/lib

    这一步可以不做,make install 操作默认操作

    [root@node16 torque-6.1.3]# libtool --finish /hpc/torque/6.1.3/lib
    libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/sbin" ldconfig -n /hpc/torque/6.1.3/lib
    ----------------------------------------------------------------------
    Libraries have been installed in:
       /hpc/torque/6.1.3/lib
    
    If you ever happen to want to link against installed libraries
    in a given directory, LIBDIR, you must either use libtool, and
    specify the full pathname of the library, or use the `-LLIBDIR'
    flag during linking and do at least one of the following:
       - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
         during execution
       - add LIBDIR to the `LD_RUN_PATH' environment variable
         during linking
       - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
       - have your system administrator add LIBDIR to `/etc/ld.so.conf'
    
    See any operating system documentation about shared libraries for
    more information, such as the ld(1) and ld.so(8) manual pages.
    ----------------------------------------------------------------------
    

    接下来是配置环境变量和配置启动脚本

    此时执行ls -lrt /usr/lib/systemd/system,发现目录下已经有了

    -rw-r--r--  1 root root 1284 10月 12 11:17 pbs_server.service
    -rw-r--r--  1 root root  704 10月 12 11:17 pbs_mom.service
    -rw-r--r--  1 root root  335 10月 12 11:17 trqauthd.servic
    

    少了一个pbs_sched.service启动脚本,从目录/hpc/packages/torque-6.1.3/contrib/systemd目录拷贝到系统中

    cp /hpc/packages/torque-6.1.3/contrib/systemd/pbs_sched.service /usr/lib/systemd//system

    此时执行ls -lrt /etc/profile.d能够看到目录下已经有了torque.sh,只需要执行source /etc/profile就可以了

    3. 配置

    3.1 配置管理节点

    3.1.1添加pbs管理用户

    这里设置为root用户

    ./torque.setup,这个脚本注释:create pbs_server database and default queue

    [root@node16 torque-6.1.3]# ./torque.setup root
    hostname: node13
    Currently no servers active. Default server will be listed as active server. Error  15133
    Active server name: node13  pbs_server port is: 15001
    trqauthd daemonized - port /tmp/trqauthd-unix
    trqauthd successfully started
    initializing TORQUE (admin: root)
    
    You have selected to start pbs_server in create mode.
    If the server database exists it will be overwritten.
    do you wish to continue y/(n)?
    # 输入y
    

    3.1.2 启动组件认证服务

    3.1.1的步骤会启动认证服务trqauthd,执行ps axu|grep trqauthd可验证

    后续通过systemctl start trqauthd时会报错,因此此时建议执行pkill -f trqauthd先处理掉该进程,再通过systemctl start trqauthd启动

    pkill -f trqauthd
    systemctl start trqauthd
    systemctl enable trqauthd
    

    3.1.3 启动主服务

    配置计算节点,vim /var/spool/torque/server_priv/nodes

    node17 np=4
    node18 np=4
    

    然后执行以下命令

    systemctl status pbs_server 
    systemctl start pbs_server #如果这一步执行失败,查看是否已经启动了pbs_server,如果启动执行pkill -f pbs_server,然后再执行此命令
    systemctl enable pbs_server
    

    执行qnodes查看信息

    如果没有qnodes命令,执行source /etc/profile加载环境变量

    node17
         state = down
         power_state = Running
         np = 4
         ntype = cluster
         mom_service_port = 15002
         mom_manager_port = 15003
         total_sockets = 0
         total_numa_nodes = 0
         total_cores = 0
         total_threads = 0
         dedicated_sockets = 0
         dedicated_numa_nodes = 0
         dedicated_cores = 0
         dedicated_threads = 0
    
    node18
         state = down
         power_state = Running
         np = 4
         ntype = cluster
         mom_service_port = 15002
         mom_manager_port = 15003
         total_sockets = 0
         total_numa_nodes = 0
         total_cores = 0
         total_threads = 0
         dedicated_sockets = 0
         dedicated_numa_nodes = 0
         dedicated_cores = 0
         dedicated_threads = 0
    # node17和node18因为没有启动pbs_mom,所以状态显示为down
    

    3.1.4 启动调度服务

    在node16上还要执行systemctl start pbs_sched,否则提交的作业全部为Q状态

    设置开机启动systemctl enable pbs_sched

    3.2 配置计算节点

    3.1部分完成了管理节点node16的部署,包括:

    • yum安装依赖环境
    • 解压源码,配置编译信息,编译安装
    • 配置管理用户
    • 编辑配置文件
    • 启动trqauthed服务,启动pbs_server服务,启动pbs_sched服务

    计算节点需要完成的内容:

    • yum安装依赖环境
    • 配置管理节点信息
    • 执行安装脚本,或者make install
    • 启动trqauthd服务,启动pbs_mom服务

    因为所有的操作均在共享目录下进行,因此只需要在node17和node18节点上执行make install即可

    4. 使用

    4.1 查看和激活队列

    在3.1.1过程中执行torque.setup执行后,会默认添加一个batch队列,并设置了队列的一些基本属性

    此时需要执行qmgr active queue batch,才能够往这个队列提交作业

    提交作业在管理节点上执行,在计算节点执行会报错

    [liwl01@node18 ~]$ echo "sleep 120"|qsub
    qsub: submit error (Bad UID for job execution MSG=ruserok failed validating liwl01/liwl01 from node18)
    

    在node16上提交作业

    [liwl01@node16 ~]$ echo "sleep 300"|qsub
    1.node16
    

    计算中执行,S表示的作业状态为R,运行状态

    [liwl01@node16 ~]$ qstat -a -n
    
    node16: 
                                                                                      Req'd       Req'd       Elap
    Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
    ----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
    1.node16                liwl01      batch    STDIN             20038     1      1       --   01:00:00 R  00:04:17
       node17/0
    

    计算结束后,S表示的作业状态C,完成状态

    [liwl01@node16 ~]$ qstat -a -n
    
    node16: 
                                                                                      Req'd       Req'd       Elap
    Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
    ----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
    1.node16                liwl01      batch    STDIN             20038     1      1       --   01:00:00 C       -- 
       node17/0
    

    而qnodes执行结果

    [liwl01@node16 ~]$ qnodes 
    node17
         state = free
         power_state = Running
         np = 4
         ntype = cluster
         jobs = 0/1.node16
         status = opsys=linux,uname=Linux node17 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64,sessions=20038,nsessions=1,nusers=1,idletime=2159,totmem=3879980kb,availmem=3117784kb,physmem=3879980kb,ncpus=4,loadave=0.00,gres=,netload=1039583908,state=free,varattr= ,cpuclock=Fixed,macaddr=00:00:00:80:00:17,version=6.1.3,rectime=1634101823,jobs=1.node16
         mom_service_port = 15002
         mom_manager_port = 15003
         total_sockets = 1
         total_numa_nodes = 1
         total_cores = 4
         total_threads = 4
         dedicated_sockets = 0
         dedicated_numa_nodes = 0
         dedicated_cores = 0
         dedicated_threads = 1
    

    5. 维护

    待后期更新

    【 欢迎交流探讨!邮箱:yunweinote@126.com】
  • 相关阅读:
    数据分析的数据来源都有哪些?
    数据分析的技能要求及分析流程
    (原创)使用matlab-cftools拟合工具的问题
    Spring加载xml配置文件的方式
    Spring-ResolvableType可解决的数据类型
    从list中取N个随机生成一个集合
    AOP统一处理修改人、创建人、修改时间、创建时间
    Java依据集合元素的属性,集合相减
    java去掉数字后面的0
    数字格式化NumberFormat
  • 原文地址:https://www.cnblogs.com/liwanliangblog/p/15401756.html
Copyright © 2011-2022 走看看