Monday, November 9, 2020

Job Scheduler for cluster

The software systems responsible for making these clusters of computers work together can be called Distributed Research Management System. The most commonly used ones are SGE, PBS/TORQUE, and SLURM. 

PBS command vs SGE commands

http://www.softpanorama.org/HPC/PBS_and_derivatives/Reference/pbs_command_vs_sge_commands.shtml

SGE to SLURM conversion

I. Sun Grid Engine installation on Centos Server

http://biowiki.org/wiki/index.php/Sun_Grid_Engine

STEP 0: CREATE THE sgeadmin USER
Create a user account named sgeadmin on the head node and all the exec nodes, with the group name being sgeadmin also (although the group name is probably not important, just as long as it is the same on all the nodes). Make sure the user IDs and group IDs are the same for this account across all those nodes (this consistency actually is very important).
## add user
sudo useradd -m sgeadmin -p 123456
## add group
sudo groupadd sgeadmin 
## add user to a group
sudo usermod -a -G sgeadmin sgeadmin 

STEP 1: PREPARE THE FILES
Download https://arc.liv.ac.uk/trac/SGE/
sge-8.0.0a-common.tar.gz   and  sge-8.0.0a-bin-lx-amd64.tar.gz
we will do a local installation on each node- that is, each node with SGE will have its own copy of the SGE binaries and its own local spool directory. This is to minimize NFS traffic, as the NFS will be probably used pretty intensively already for writing output of SGE jobs to the RAID node and for other things.
Use the same $SGE_ROOT=/opt/sge  on each node

http://biowiki.org/wiki/index.php/Sun_Grid_Engine

STEP 2: PREPARE THE MASTER/SUBMIT/ADMINISTRATION HOST (master node)
read README.BUILD
# Install libhwloc-dev deb package:
change to root: sudo su -
sudo apt-get update && sudo apt-get upgrade -y
sudo apt-get install libhwloc-dev 

2.1. Build the dependencies: 
change to root: 
untar 2 files into /opt/sge
sudo mkdir /opt/sge
tar xvf sge-8.0.0a-common.tar.gz --directory /opt/sge
tar xvf sge-8.0.0a-bin-lx-amd64.tar.gz --directory /opt/sge

cd /home/canlab/wSourceCode/sge-8.1.9/source
sh scripts/bootstrap.sh -no-java -no-jni 
./aimk -no-java -no-jni 


2.2 The Configuration File: SGE provides automated installation scripts that will read options you set in your configuration file and perform an installation using them.. We are going to use a configuration file based on the template in wSourceCode/sge-8.1.9/source/dist/util/install_modules/inst_template.conf
Make a copy of the template and fill out the options tha_configuration.conf,
SGE_ROOT="/opt/sge"
SGE_JMX_SSL_CLIENT="false"
CELL_NAME="default"
ADMIN_USER=canlab
QMASTER_SPOOL_DIR=$SGE_ROOT/$CELL_NAME/spool/qmaster
EXECD_SPOOL_DIR=$SGE_ROOT/$CELL_NAME/spool
ADMIN_HOST_LIST="canHead"
SUBMIT_HOST_LIST="canHead"
EXEC_HOST_LIST=`canHead`
EXECD_SPOOL_DIR_LOCAL="$SGE_ROOT/$CELL_NAME/spool/execd"
ADMIN_MAIL="none"


Install:
log into the head node as root, add the SGE_QMASTER_PORT and SGE_EXECD_PORT (which should be two different ports set in conf. file) to your /etc/services file.
 # SUN GRID ENGINE
  sge_qmaster	  6444/tcp	 # for Sun Grid Engine (SGE) qmaster daemon
  sge_execd	  6445/tcp	 # for Sun Grid Engine (SGE) exec daemon
execute the inst_sge script on the head node with the parameters -m (install Master Host, which is also the implied Submit and Administration Host), -x (install Execution Hosts), and -auto (read settings from the configuration file). In our case, this will be:

export SGE_ROOT=/opt/sge
cd /opt/sge
./inst_sge -m -x -auto /opt/sge/util/install_modules/tha_configuration.conf






II. Sun Grid Engine installation on Ubuntu Server

II.1. Try this for SGE: https://tkainrad.dev/posts/copy-paste-ready-instructions-to-set-up-1-node-clusters/
* gain root permissions On Ubuntu: sudo -i (or just put sudo before commands)
* Install Dependencies:
sudo apt-get update -y \
&& sudo apt-get install -y sudo bsd-mailx tcsh db5.3-util libhwloc5 libmunge2 libxm4 libjemalloc1 xterm openjdk-8-jre-headless \
&& sudo apt-get clean \
&& sudo rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

1. create a new folder & download source-code:
sudo mkdir -p /opt/sge/installfolder
export INSTALLFOLDER=/opt/sge/installfolder
##--
cd $INSTALLFOLDER 
sudo wget https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-common_8.1.9_all.deb
sudo wget https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge-doc_8.1.9_all.deb
sudo wget https://arc.liv.ac.uk/downloads/SGE/releases/8.1.9/sge_8.1.9_amd64.deb
sudo dpkg -i --force-all  ./*.deb

2. Setup files:
download the following 4 files (sge_init.shsge_auto_install.confsge_hostgrp.confsge_exec_host.conf) and place them also into: /opt/sge/installfolder:
sudo wget https://tkainrad.dev/other/sge_init.sh

those scripts and configuration files automatically perform setting. 
# Edit file sge_auto_instll.conf
SGE_ROOT="/opt/sge"
SGE_CLUSTER_NAME="docker-sge"
CELL_NAME="default"
# Edit file sge_auto_instll.conf
export SGE_HOST='cat /opt/sge/default/common/act_qmaster'
/etc/init.d/sgemaster.docker-sge restart
/etc/init.d/sgeexecd.docker-sge restart

#After the download, we need to set some environment variables in the current shell:
export SGE_ROOT=/opt/sge 
export SGE_CELL=default

#We also need to set a new profile.d config via
sudo ln -s $SGE_ROOT/$SGE_CELL/common/settings.sh /etc/profile.d/sge_settings.sh

3. Install
# execute the following to install SGE and perform setup operations:
useradd -r -m -U -G sudo -d /home/sgeuser -s /bin/bash -c "Docker SGE user" sgeuser
cd $SGE_ROOT 
##--
sudo ./inst_sge -m -x -s -auto $INSTALLFOLDER/sge_auto_install.conf \
&& sleep 10 \
&& /etc/init.d/sgemaster.docker-sge restart \
&& /etc/init.d/sgeexecd.docker-sge restart \
&& sed -i "s/HOSTNAME/`hostname`/" $INSTALLFOLDER/sge_exec_host.conf \
&& sed -i "s/HOSTNAME/`hostname`/" $INSTALLFOLDER/sge_hostgrp.conf \
&& /opt/sge/bin/lx-amd64/qconf -Me $INSTALLFOLDER/sge_exec_host.conf 


## Note: to reinstall, we need to delete these files in:  /etc/init.d
sudo rm -r -f sgemaster.docker-sge
sudo rm -r -f sgeexecd.docker-sge
sudo rm -r -f /opt/sge/default 

4. Add users
# we still need to add users to the sgeusers group, which was defined in the sge_hostgrp.conf file you just applied. Only users from this group are allowed to submit jobs. Therefore, we run the following:
/opt/sge/bin/lx-amd64/qconf -au <USER> sgeusers
sudo /opt/sge/bin/lx-amd64/qconf -au canlab sgeusers
/opt/sge/bin/lx-amd64/qconf -au hung sgeusers



II.2 Another way:

https://www.socher.org/index.php/Main/HowToInstallSunGridEngineOnUbuntu
https://peteris.rocks/blog/sun-grid-engine-installation-on-ubuntu-server/

1. On Master Node
(install Master Host, which is also the implied Submit and Administration Host)
(install Execution Hosts)
https://gist.github.com/asadharis/9d14da97d9ad1f8eccc36dc14390e4e0
git clone https://gist.github.com/9d14da97d9ad1f8eccc36dc14390e4e0.git sgeSetup/
cd sgeSetup
sudo chmod +x install_sge.sh loop.sh sleep.sh
./install_sge.sh
./loop.sh

2. On woker Nodes


3. Unistall sge
https://howtoinstall.co/en/ubuntu/xenial/gridengine-master?action=remove
sudo apt-get autoremove --purge gridengine-master


II.3 Configure SGE

https://southgreenplatform.github.io/trainings/hpc/sgeinstallation/



B. PBS / Torque

RHEL, CentOS, and Scientific Linuxyum install
# Ubuntu: sudo apt-get install

I. PBS on Ubuntu

http://docs.adaptivecomputing.com/torque/5-0-0/Content/topics/torque/1-installConfig/installing.htm

https://pmateusz.github.io/linux/torque/2017/03/25/torque-installation-on-ubuntu.html

http://docs.adaptivecomputing.com/torque/5-1-3/Content/topics/hpcSuiteInstall/manual/1-installing/installingTorque.htm

https://tkainrad.dev/posts/copy-paste-ready-instructions-to-set-up-1-node-clusters/#pbs--torque

1. installing the relevant packages required to run TORQUE 5.1.1:
sudo apt-get install libboost-all-dev libssl-dev libxml2-dev tcl8.6-dev tk8.6-dev libhwloc-dev cpuset 

2. Download & ínstall TORQUE 
https://ubuntuforums.org/showthread.php?t=289767
## from github  (in use)
git clone https://github.com/adaptivecomputing/torque.git -b 6.1.1 torque-6.1.1 
cd torque-6.1.1
./autogen.sh

## not use github
wget --quiet http://wpfilebase.s3.amazonaws.com/torque/torque-6.1.0.tar.gz 
tar -xvzf torque-6.1.0.tar.gz
#############

./configure --prefix=/opt/torque --disable-werror