zoukankan      html  css  js  c++  java
  • 虚拟化集群中PBS-Torque的部署

    1. 概述

    本篇博客主要介绍在centos7操作系统集群中部署,配置和使用pbs调度系统的过程

    centos7版本:CentOS Linux release 7.9.2009 (Core)

    pbs调度系统版本:torque-6.1.3.tar.gz

    集群信息:

    使用三个虚拟节点部署pbs系统

    节点名称 节点IP 节点角色 节点服务
    node16 192.168.80.16 管理节点,登陆节点 pbs_server pbs_schd
    node17 192.168.80.17 节点节点 pbs_mom
    node18 192.168.80.18 计算节点 pbs_mom

    本篇只是pbs调度软件torque的基本部署,配置,使用。更加复杂的功能,比如MPI环境,图形界面显示,GPU调度,Munge认证,数据库信息获取,高可用配置,未做详细的探究,后期有时间再进行完善。

    2. 部署

    一般集群都需要时间统一,全局身份认证,这两个步骤在本篇博客不作介绍。

    本篇博客的node16-18,已经通过ldap和sssd实现了全局身份认证。

    同时约定使用node16的一个软件安装目录作为共享目录,共享给node17和node18

    2.1 创建和挂载共享目录

    node16上执行mkdir -p /hpc/torque/6.1.3,该目录用来安装torque软件,共享给其他节点

    执行mkdir -p /hpc/packages/作为编译torque的工作目录

    编辑/etc/exportfs文件,内容如下:

    /hpc 192.168.80.0/24(rw,no_root_squash,no_all_squash)
    /home 192.168.80.0/24(rw,no_root_squash,no_all_squash)
    

    执行:systemctl start nfs && systemctl enable nfs设置nfs启动和开机启动

    执行:exportfs -r,使共享目录即时生效

    在node17,node18上分别执行:

    mkdir -p /hpc
    mount.nfs 192.168.80.16:/hpc /hpc
    mount.nfs 192.168.80.16:/home /home
    

    2.2 部署torque软件

    下载

    wget http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.3.tar.gz
    

    解压

    tar -zxvf torque-6.1.3.tar.gz -C /hpc/packages,解压到编译安装torque的工作目录

    配置安装信息

    # 1. yum安装软件依赖
    yum -y install libxml2-devel boost-devel openssl-devel libtool readline-devel pam-devel hwloc-devel numactl-devel tcl-devel tk-devel
    # yum groupinstall "GNOME Desktop" "Graphical Administrator Tools" 如果编译选项有--enable-gui时,需要安装图像界面依赖
    # 2. configure传参,配置编译安装信息
    ./configure 
    	--prefix=/hpc/torque/6.1.3 
    	--mandir=/hpc/torque/6.1.3/man  
    	--enable-cgroups 
    	--enable-syslog 
    	--enable-drmaa 
    	--enable-gui 
    	--with-xauth 
    	--with-hwloc 
    	--with-pam 
    	--with-tcl 
    	--with-tk 
    	# --enable-numa-support 	#这个参数加上,要编辑mom.layout,不清楚规则,暂时取消
    # 3. 更新。后期可能添加对MPI,GPU,Munge认证,高可用配置等的支持,本篇后期补充
    

    执行结束后:

    Building components: server=yes mom=yes clients=yes
                         gui=yes drmaa=yes pam=yes
    PBS Machine type    : linux
    Remote copy         : /usr/bin/scp -rpB
    PBS home            : /var/spool/torque
    Default server      : node13
    
    Unix Domain sockets : 
    Linux cpusets       : no
    Tcl                 : -L/usr/lib64 -ltcl8.5 -ldl -lpthread -lieee -lm
    Tk                  : -L/usr/lib64 -ltk8.5 -lX11 -L/usr/lib64 -ltcl8.5 -ldl -lpthread -lieee -lm
    Authentication      : trqauthd
    
    Ready for 'make'.
    

    编译和安装

    # 1. 编译,安装
    make -j4 && make install
    # 2. 生成安装脚本
    # make packages
    

    执行make packages输出:

    这一步可以不做。在本篇博客中,工作目录在共享文件系统,因此只需要在每个节点执行make install即可。

    [root@node16 torque-6.1.3]# make packages
    Building packages from /hpc/packages/torque-6.1.3/tpackages
    rm -rf /hpc/packages/torque-6.1.3/tpackages
    mkdir /hpc/packages/torque-6.1.3/tpackages
    Building ./torque-package-server-linux-x86_64.sh ...
    libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
    Building ./torque-package-mom-linux-x86_64.sh ...
    libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
    Building ./torque-package-clients-linux-x86_64.sh ...
    libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
    Building ./torque-package-gui-linux-x86_64.sh ...
    Building ./torque-package-pam-linux-x86_64.sh ...
    libtool: install: warning: remember to run `libtool --finish /lib64/security'
    Building ./torque-package-drmaa-linux-x86_64.sh ...
    libtool: install: warning: relinking `libdrmaa.la'
    libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
    Building ./torque-package-devel-linux-x86_64.sh ...
    libtool: install: warning: remember to run `libtool --finish /hpc/torque/6.1.3/lib'
    Building ./torque-package-doc-linux-x86_64.sh ...
    Done.
    
    The package files are self-extracting packages that can be copied
    and executed on your production machines.  Use --help for options.
    

    执行 libtool --finish /hpc/torque/6.1.3/lib

    这一步可以不做,make install 操作默认操作

    [root@node16 torque-6.1.3]# libtool --finish /hpc/torque/6.1.3/lib
    libtool: finish: PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin:/sbin" ldconfig -n /hpc/torque/6.1.3/lib
    ----------------------------------------------------------------------
    Libraries have been installed in:
       /hpc/torque/6.1.3/lib
    
    If you ever happen to want to link against installed libraries
    in a given directory, LIBDIR, you must either use libtool, and
    specify the full pathname of the library, or use the `-LLIBDIR'
    flag during linking and do at least one of the following:
       - add LIBDIR to the `LD_LIBRARY_PATH' environment variable
         during execution
       - add LIBDIR to the `LD_RUN_PATH' environment variable
         during linking
       - use the `-Wl,-rpath -Wl,LIBDIR' linker flag
       - have your system administrator add LIBDIR to `/etc/ld.so.conf'
    
    See any operating system documentation about shared libraries for
    more information, such as the ld(1) and ld.so(8) manual pages.
    ----------------------------------------------------------------------
    

    接下来是配置环境变量和配置启动脚本

    此时执行ls -lrt /usr/lib/systemd/system,发现目录下已经有了

    -rw-r--r--  1 root root 1284 10月 12 11:17 pbs_server.service
    -rw-r--r--  1 root root  704 10月 12 11:17 pbs_mom.service
    -rw-r--r--  1 root root  335 10月 12 11:17 trqauthd.servic
    

    少了一个pbs_sched.service启动脚本,从目录/hpc/packages/torque-6.1.3/contrib/systemd目录拷贝到系统中

    cp /hpc/packages/torque-6.1.3/contrib/systemd/pbs_sched.service /usr/lib/systemd//system

    此时执行ls -lrt /etc/profile.d能够看到目录下已经有了torque.sh,只需要执行source /etc/profile就可以了

    3. 配置

    3.1 配置管理节点

    3.1.1添加pbs管理用户

    这里设置为root用户

    ./torque.setup,这个脚本注释:create pbs_server database and default queue

    [root@node16 torque-6.1.3]# ./torque.setup root
    hostname: node13
    Currently no servers active. Default server will be listed as active server. Error  15133
    Active server name: node13  pbs_server port is: 15001
    trqauthd daemonized - port /tmp/trqauthd-unix
    trqauthd successfully started
    initializing TORQUE (admin: root)
    
    You have selected to start pbs_server in create mode.
    If the server database exists it will be overwritten.
    do you wish to continue y/(n)?
    # 输入y
    

    3.1.2 启动组件认证服务

    3.1.1的步骤会启动认证服务trqauthd,执行ps axu|grep trqauthd可验证

    后续通过systemctl start trqauthd时会报错,因此此时建议执行pkill -f trqauthd先处理掉该进程,再通过systemctl start trqauthd启动

    pkill -f trqauthd
    systemctl start trqauthd
    systemctl enable trqauthd
    

    3.1.3 启动主服务

    配置计算节点,vim /var/spool/torque/server_priv/nodes

    node17 np=4
    node18 np=4
    

    然后执行以下命令

    systemctl status pbs_server 
    systemctl start pbs_server #如果这一步执行失败,查看是否已经启动了pbs_server,如果启动执行pkill -f pbs_server,然后再执行此命令
    systemctl enable pbs_server
    

    执行qnodes查看信息

    如果没有qnodes命令,执行source /etc/profile加载环境变量

    node17
         state = down
         power_state = Running
         np = 4
         ntype = cluster
         mom_service_port = 15002
         mom_manager_port = 15003
         total_sockets = 0
         total_numa_nodes = 0
         total_cores = 0
         total_threads = 0
         dedicated_sockets = 0
         dedicated_numa_nodes = 0
         dedicated_cores = 0
         dedicated_threads = 0
    
    node18
         state = down
         power_state = Running
         np = 4
         ntype = cluster
         mom_service_port = 15002
         mom_manager_port = 15003
         total_sockets = 0
         total_numa_nodes = 0
         total_cores = 0
         total_threads = 0
         dedicated_sockets = 0
         dedicated_numa_nodes = 0
         dedicated_cores = 0
         dedicated_threads = 0
    # node17和node18因为没有启动pbs_mom,所以状态显示为down
    

    3.1.4 启动调度服务

    在node16上还要执行systemctl start pbs_sched,否则提交的作业全部为Q状态

    设置开机启动systemctl enable pbs_sched

    3.2 配置计算节点

    3.1部分完成了管理节点node16的部署,包括:

    • yum安装依赖环境
    • 解压源码,配置编译信息,编译安装
    • 配置管理用户
    • 编辑配置文件
    • 启动trqauthed服务,启动pbs_server服务,启动pbs_sched服务

    计算节点需要完成的内容:

    • yum安装依赖环境
    • 配置管理节点信息
    • 执行安装脚本,或者make install
    • 启动trqauthd服务,启动pbs_mom服务

    因为所有的操作均在共享目录下进行,因此只需要在node17和node18节点上执行make install即可

    4. 使用

    4.1 查看和激活队列

    在3.1.1过程中执行torque.setup执行后,会默认添加一个batch队列,并设置了队列的一些基本属性

    此时需要执行qmgr active queue batch,才能够往这个队列提交作业

    提交作业在管理节点上执行,在计算节点执行会报错

    [liwl01@node18 ~]$ echo "sleep 120"|qsub
    qsub: submit error (Bad UID for job execution MSG=ruserok failed validating liwl01/liwl01 from node18)
    

    在node16上提交作业

    [liwl01@node16 ~]$ echo "sleep 300"|qsub
    1.node16
    

    计算中执行,S表示的作业状态为R,运行状态

    [liwl01@node16 ~]$ qstat -a -n
    
    node16: 
                                                                                      Req'd       Req'd       Elap
    Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
    ----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
    1.node16                liwl01      batch    STDIN             20038     1      1       --   01:00:00 R  00:04:17
       node17/0
    

    计算结束后,S表示的作业状态C,完成状态

    [liwl01@node16 ~]$ qstat -a -n
    
    node16: 
                                                                                      Req'd       Req'd       Elap
    Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory      Time    S   Time
    ----------------------- ----------- -------- ---------------- ------ ----- ------ --------- --------- - ---------
    1.node16                liwl01      batch    STDIN             20038     1      1       --   01:00:00 C       -- 
       node17/0
    

    而qnodes执行结果

    [liwl01@node16 ~]$ qnodes 
    node17
         state = free
         power_state = Running
         np = 4
         ntype = cluster
         jobs = 0/1.node16
         status = opsys=linux,uname=Linux node17 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64,sessions=20038,nsessions=1,nusers=1,idletime=2159,totmem=3879980kb,availmem=3117784kb,physmem=3879980kb,ncpus=4,loadave=0.00,gres=,netload=1039583908,state=free,varattr= ,cpuclock=Fixed,macaddr=00:00:00:80:00:17,version=6.1.3,rectime=1634101823,jobs=1.node16
         mom_service_port = 15002
         mom_manager_port = 15003
         total_sockets = 1
         total_numa_nodes = 1
         total_cores = 4
         total_threads = 4
         dedicated_sockets = 0
         dedicated_numa_nodes = 0
         dedicated_cores = 0
         dedicated_threads = 1
    

    5. 维护

    待后期更新

    【 欢迎交流探讨!邮箱:yunweinote@126.com】
  • 相关阅读:
    OK335xS-Android mkmmc-android-ubifs.sh hacking
    OK335xS-Android pack-ubi-256M.sh hacking
    OK335xS Ubuntu 12.04.1 版本 Android 开发环境搭建
    Qt Quick Hello World hacking
    Qt QML referenceexamples attached Demo hacking
    QT 5.4.1 for Android Ubuntu QtWebView Demo
    I.MX6 working note for high efficiency
    QT 5.4.1 for Android Windows环境搭建
    mkbootimg hacking
    Generate And Play A Tone In Android hacking
  • 原文地址:https://www.cnblogs.com/liwanliangblog/p/15401756.html
Copyright © 2011-2022 走看看