Now we have setup the security groups to allow communication, we can proceed and change the Slurm configuration.
First lets findout the hostname of the headnode belonging to the onprem cluster. You should be logged into the Cloud9 instance.
pcluster ssh -n onprem -i ~/.ssh/ssh-key.pem -r ${AWS_REGION} hostname
ip-172-31-30-17
Keep a note of the name returned. You will need it in the next step. Don’t use the example above. Login to the cloud cluster for the next few steps.
pcluster ssh -n cloud -i ~/.ssh/ssh-key.pem -r ${AWS_REGION}
sudo vi /opt/slurm/etc/slurm_parallelcluster.conf
Edit the file and change the AccountingStorageHost to the On-Prem headnode hostname you retrieved in the previous step. Note the IP addresses and names in this example are likely different to your cluster. The only change you need to make is to change the AccountingStorageHost value to match the onprem hostname you discovered a moment ago.
SlurmctldHost=ip-172-31-34-123(172.31.34.123)
SuspendTime=120
ResumeTimeout=1800
SelectTypeParameters=CR_CPU
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=ip-172-31-30-17
AccountingStoragePort=6819
AccountingStorageUser=slurm
JobAcctGatherType=jobacct_gather/cgroup
include /opt/slurm/etc/pcluster/slurm_parallelcluster_cloudq_partition.conf
SuspendExcNodes=cloudq-st-c6i-[1-1]
Once the edit is done, save the file with :wq if you used vi, then proceed.
Now we need to restart the Slurmctld process so it rereads the configuration.
sudo systemctl restart slurmctld
If all goes well you should now be able to see both clusters if you run the show clusters command:
sacctmgr show clusters
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
cloud 172.31.34.123 6820 9728 1 normal
onprem 172.31.30.17 6820 9728 1 normal
We need to now setup federation to configure the clusters to work together. A single command is all that is needed.
sacctmgr add federation fedone clusters=onprem,cloud
Adding Federation(s)
fedone
Settings
Cluster = onprem
Cluster = cloud
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
Now lets confirm the federation is working:
sacctmgr show federation
Federation Cluster ID Features FedState
---------- ---------- -- -------------------- ------------
fedone cloud 2 ACTIVE
fedone onprem 1 ACTIVE
We now have a federated Slurm cluster setup. This means that we can submit jobs to either cluster from either headnode.
However aside from being linked via Slurm the 2 clusters are not combined. In the production world we would need to ensure a consistent mapping of user ids between the clusters and arrange a means of exchanging data between them.
In this lab we will work with a single user to keep things simple.