From zero to a scalable, container-friendly cloud-based SLURM cluster in an afternoon

Aug 19, 2021

Become a HPC sysadmin using Terraform and Google Cloud Platform

Not so long ago our on-premise cluster suffered a period of instability caused by multiple hard- and software issues happening at the same time. I had an urgent simulation to run, which I had already written up to use SLURM, our job scheduler. Fortunately, my simulation did not require any sensitive data to run, only publicly available data from the 1000 Genomes project. This opened the door to running this on the cloud (I usually work with medical data from public research projects, which in Europe is hard to get on the cloud).

As (un)expected, there isn’t one consistent and fully correct documentation to do this. Google hosts a walkthrough here and a community-written codelab here, neither of which is both fully correct and exhaustive.

Prerequisites

You will need to create a project, activate Cloud Shell, make sure billing is enabled for your project, and enable APIs as described here. You’ll also need to choose a zone where the datacenters hosting your cluster will be physically located. The choice of region (e.g. us-west) and zone (e.g. us-west-1a) is determined by privacy/data localisation concerns and price. Prices of all resources vary per region (not zone) and can be found here. As a rule of thumb, Asian and European locations will be more expensive than US ones, and you should choose the region where the most expensive and heavily used type of instance (your compute nodes) is cheapest. Within a region, the choice of a specific zone is not important, available zones can be seen in cloud shell (see below) using gcloud compute zones list.

Installing the cloud shell CLI

All of the major cloud providers offer an in-browser SSH shell that allows you to monitor and orchestrate your cloud resources. Browser-based shells are impractical, so thankfully there are command-line interfaces available as well. Here’s how to install the Google Cloud shell in your local distribution. I ran this on WSL running Ubuntu 20.04.

echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
sudo apt-get install apt-transport-https ca-certificates gnupg
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
sudo apt-get update && sudo apt-get install google-cloud-sdk
gcloud init

The last command will attempt to open a browser window for you to authenticate, it will also paste a link in case you don’t have a browser installed. Then, just do :

gcloud cloud-shell ssh --authorize-session

to connect to a cloud shell machine. The last argument is optional, it makes this session carry the necessary credentials to actually administrate your account, create instances, etc.

Set the environment to your project using :

gcloud config set project [PROJECT_ID]
export CLUSTER_NAME="cluster-name"
export CLUSTER_ZONE="cluster-zone"

Building and configuring a cluster using Terraform

For a minimal SLURM cluster, you typically need:

  • a login node
  • a controller node
  • a worker node

For any meaningful workload, you’ll need to add more than 1 worker node. You can have more specialised instances, like a Lustre server to enable fast parallel storage, for example.

Rather than define each of these instances separately we will use the Terraform orchestration software. This allows you to describe the machines and the links between them in text files. Terraform will use the Gcloud API to instantiate the resources you need. This will also make your cluster elastic, as in, it will react to demand and spin up additional workers as needed.

I used SLURM-GCP, a terraform script written by SchedMD, the company behind SLURM.

## make sure you are using Cloud Shell, don't run this locally.
git clone https://github.com/SchedMD/slurm-gcp && cd slurm-gcp
cd tf/examples/singularity && cp basic.tfvars.example basic.tfvars

The basic.tfvars is the file we will use to describe our cluster. We use singularity as a container solution for our workflows, thankfully slurm-gcp has an example deployment that will save you the trouble of installing it on all the nodes. Let’s open slurm-gcp/tf/examples/singularity/basic.tfvars.

Editing Terraform variables

Environment

Now edit that file using your favourite editor (I use nano). The first 3 lines are set according to your project:

cluster_name = "${CLUSTER_NAME}"
project      = "$(gcloud config get-value core/project)"
zone         = "${CLUSTER_ZONE}"

Controller node

The next few lines define the controller node, where the SLURM master monitoring process will be installed. In most cases controller_machine_type = "n1-standard-2" should be enough, except if you plan on spinning up a really large cluster. Set the remaining variables in this section to:

controller_image        = "projects/schedmd-slurm-public/global/images/family/schedmd-slurm-20-11-7-hpc-centos-7"
controller_disk_type    = "pd-balanced"
controller_disk_size_gb = 438

At the time of writing, the above was the most recent version of the controller image (the OS with SLURM preinstalled), the most recent version will be documented on the project homepage.

Login node

It is best practice to have a second node as a login node, where users ssh into to launch their jobs. You can configure this the same way as the controller node, with a smaller and slower disk, since you don’t expect .

login_machine_type = "n1-standard-2"
login_image        = "projects/schedmd-slurm-public/global/images/family/schedmd-slurm-20-11-7-hpc-centos-7"
login_disk_type    = "pd-standard"
login_disk_size_gb = 20

Cluster

We will define only a single partition (group of nodes), but you can easily define more if you have several types of jobs (e.g. GPU or memory intensive).

partitions = [
{ name                 = "normal"
  machine_type         = "n1-highmem-16"
  static_node_count    = 0
  max_node_count       = 2
  zone                 = "us-west1-a"
  image        = "projects/schedmd-slurm-public/global/images/family/schedmd-slurm-20-11-7-hpc-centos-7"
  image_hyperthreads   = false
  compute_disk_type    = "pd-standard"
  compute_disk_size_gb = 21
  compute_labels       = {}
  cpu_platform         = null
  gpu_count            = 0
  gpu_type             = null
  network_storage      = []
  preemptible_bursting = false
  vpc_subnet           = null
  exclusive            = false
  enable_placement     = false
  regional_capacity    = false
  regional_policy      = {}
  instance_template    = null
  }
]

Deploying

Type:

terraform apply -var-file basic.tfvars

All instances, as well as the network linking them, will be created using API calls by Terraform.

Logging in & testing

The gcloud suite is available on your cloud shell machine, you can also install it locally. Just ssh into your controller node:

gcloud compute ssh simcluster-controller --zone us-west1-a

Let’s try to pull a container:

module add singularity
singularity pull library://hmgu-itg/default/burden_testing:latest
srun --pty singularity shell burden_testing_latest.sif ## this will take some time, since the node first has to spin up

Configuring storage

So far we have just configured machines, but we need some form of shared storage. This is where it gets tricky.

Mounting a bucket

A bucket is a collection of files and folders stored in the cloud. You can create buckets here, following which it’s very easy to mount on any node:

mkdir data && gcsfuse --implicit-dirs $YOUR_BUCKET_NAME data

To make this permanent on all nodes, change basic.tfvars as follows:

network_storage = [{
  server_ip     = "gcs"
  remote_mount  = "shared-nfs-treps"
  local_mount   = "/data"
  fs_type       = "gcsfuse"
  mount_options = "rw,_netdev,user,file_mode=664,dir_mode=775,allow_other"
}]

In order to mount this space on the worker and login nodes, change the following in the scopes section and make sure that your default service account has the IAM role “storage admin” at this link:

 compute_node_service_account = "default"
 compute_node_scopes          = [
   "https://www.googleapis.com/auth/monitoring.write",
   "https://www.googleapis.com/auth/logging.write",
   "https://www.googleapis.com/auth/devstorage.full_control"
 ]
  login_node_service_account = "default"
 login_node_scopes          = [
   "https://www.googleapis.com/auth/monitoring.write",
   "https://www.googleapis.com/auth/logging.write",
   "https://www.googleapis.com/auth/devstorage.full_control"
 ]

Unfortunately, buckets are not really masde for heavy I/O. They are more appropriate for storing final output files and downloading input data.

Using NFS for shared storage

By default, fstab contains simcluster-controller:/home /home nfs defaults,hard,intr,_netdev 0 0 on the nodes, meaning that any subdir of ~ will be accessible in read/write via NFS. The issue with that is that since it’s one node serving it, you will have a bottleneck if your processes are I/O-intensive. You can easily mount other shares from the controller by adding directives to network_storage, although the utility of that is limited. You can of course add external NFS storage by adding the proper server_ip.

Other solutions

Google Filestore

Google has its own key-in-hand storage server solution. Using Filestore, you can spin up a storage cluster that can be mounted onto the SLURM cluster nodes via NFS. The homepage is here, and the quickstart is here.

Lustre

Lustre is the a professional solution for highly parallel, I/O intensive applications. Storage is provided through a dedicated cluster, so this is more involved to deploy. There is a community-developed script that uses Deployment Manager (an alternative to Terraform) to create a Slurm cluster here. The two clusters have to be on the same VPC Network to be able to communicate (as per this thread), so you’ll need to create the Lustre cluster on a particular network, then create the Slurm cluster in the same VPC.

There is also a commercial solution by Fluid Numerics, that creates a complete system with both Lustre and Filestore enabled here. That incurs a $0.01 USD/vCPU/hour and $0.09 USD/GPU/hour usage fee, but the code is open source and can be freely used for inspiration.

Understanding the limits of the free plan

Although the free plan is supposedly designed so that you can gain exposure to all GCP features, it is actually severely limited. You may already have an idea of a project that you want to run on the cloud when you start out. A major concern is how much it will cost to run it, using which resources. Unfortunately the free plan makes it very hard to make that estimate.

That’s because you will be unable to use the resources that make the most sense, or are priced best, to run your project, and therefore gain a feel for how long you’d need them for to run your jobs. This is especially true for HPC-style tasks.

For example, you may be interested in using spot/preemptible instances to reduce costs, or you may have large memory, CPU or storage needs. What you can and cannot use is determined by quotas. This is probably both the most important and least well documented feature of GCP and most cloud providers. There is one global quota for every broad class of resource you may use (total number of CPUs), and one for every specific subclass (CPUs on E-type machines). To complicate things further, there are also region-specific quotas. The ideas is that these quotas are preventing you from accidentally “ordering” 1,000 high-memory CPUs and getting an unexpected eye-watering bill.

Before running any big project, you make an estimation of the amount of required resources, and request an increase in the corresponding quotas. The trouble is that you cannot request a quota change on the free plan, so are limited to the default ones that are set up when you create your account.

These are not documented anywhere, so you tend to discover them through trial and error. I’ve managed to figure out the following, but that’s not nearly all of them:

  • 20 E-type CPUs;
  • 0 M-type CPUs (high-memory nodes);
  • 32 CPUs globally;
  • 500Gb storage;
  • no preemptible instances;

Needless to say, you can’t really test much in terms of realistic HPC jobs with this. So there’s many functionalities I couldn’t test, notably the one I was most interested in, preemptible instances. Most importantly I couldn’t test a full, realistic application scenario by running my simulation in near-real conditions.

Conclusion

Once you’ve got everything figured out, slurm-gcp provides a nifty solution to get an emergency SLURM cluster up and running in a few hours.

It is however, much harder than just pressing a button and getting your nodes. I was actually quite surprised by how much effort it was to get going. This is not specific to GCP, I’ve had similar experiences with Azure CycleCloud and Amazon ParallelCluster.

Despite their advantages, these are still complex software tools with their own ecosystem, lingo and quirks, each with their implied learning curves.

This is not helped by the crippling limitations of the trial plans, which prevent you from really getting to grips with how the systems work.

However, they do allow you to play around. slurm-gcp is just the tip of the iceberg in terms of what these solutions allow, in that they actually port “old” technology (HPC/SLURM) on new hardware as a service. Another interesting area of potential relevance to scientific workloads are hybrid Kubernetes+HPC clusters, which combine the benefits of job scheduling and orchestration systems.