Create a Grid Engine cluster with Preemptible VM workers

The properly rendered version of this document can be found at Read The Docs.

If you are reading this on github, you should instead click here.

With Compute Engine Preemptible Virtual Machines you can create and run VMs in the cloud at a much lower price than normal instances. The trade-off for the lower price is that individual instances will run for at most 24 hours.

This trade-off is often a very good fit for distributed batch compute jobs, such as a large Grid Engine job consisting of many small stateless tasks. When a preemptible VM is terminated, only the work of the current task running on it is lost, and the lost task can be requeued for execution on another worker node. The preempted VM can also often be replaced to bring the cluster back to full strength.

This document builds on the instructions to Create a Grid Engine cluster on Compute Engine to create a Grid Engine cluster of preemptible VMs.

Note

The instructions presented here are guidelines that have been used to create clusters up to 100 nodes. However when preemption rates are high, Elasticluster’s re-provisioning of clusters (via Ansible) often converges too slowly due to repeated failures.

For best success with the instructions here, it is recommended to keep cluster sizes to 20 compute nodes or fewer. For larger clusters, use regular (non-preemptible) virtual machines.

Toolset

To run a Grid Engine workload on preemptible VMs, the instructions here employ three tools:

Steps

Setting up your cluster

To create a Grid Engine cluster with preemptible VMs, follow the instructions provided in Create a Grid Engine cluster on Compute Engine and configure the compute nodes of your cluster to be preemptible.

A gridengine cluster is composed of one frontend node, and multiple compute nodes. The frontend node should NOT be preemptible. Only the compute nodes should be.

If your cluster is named gridengine then after you Create your cluster definition file, configure the compute nodes to be preemptible by adding the following to ~/.elasticluster/config:

[cluster/gridengine/compute]
scheduling=preemptible

Monitoring your cluster

As the cluster runs, Compute Engine instances will be automatically terminated independently some time within 24 hours of being created. You typically will want the following to happen:

  1. TERMINATED instances get removed from the cluster
  2. New instances are created to replace TERMINATED instances

This is what the cluster_monitor.sh script does.

The cluster monitoring script is available in the grid-computing-tools github project.

Downloading the grid-computing-tools

To clone the grid-computing-tools github project, issue the following:

git clone https://github.com/googlegenomics/grid-computing-tools.git

It is recommended that you clone the grid-computing-tools project into a sibling directory of your elasticluster directory.

Running cluster_monitor.sh

Be sure that the elasticluster executable is in your PATH. You can do this by setting the PATH explicitly or by activating the elasticluster virtualenv in your shell:

source elasticluster/bin/active

Then run the cluster monitor script:

grid-computing-tools/bin/cluster_monitor.sh gridengine

The script will run continuously; to terminate the script, hit Ctrl-C.

By default, the monitor will check the cluster status and then sleep for 10 minutes. To change the sleep interval, you can pass an additional argument on the command line, for example:

grid-computing-tools/bin/cluster_monitor.sh gridengine 5

would sleep for 5 minutes between checks.

To grow your cluster

To increase the number of workers in your cluster while it is running, update the compute_nodes value in ~/.elasticluster/config. For example, to increase the number of compute nodes from the 3 specified in the Create a Grid Engine cluster on Compute Engine instructions to 10, set:

[cluster/gridengine]
...
compute_nodes=10
...

The next time the cluster monitor wakes up, it will add nodes to the cluster to reach the new value.

To shrink your cluster

To reduce the number of workers in your cluster while it is running, update the compute_nodes value in ~/.elasticluster/config.

As the preemptible VMs are terminated, the cluster monitor will remove them from the cluster, and will only replace instances if the total number in the cluster is less than the configured value. You can also manually terminate instances if desired.

Monitoring your job

When nodes are TERMINATED, any tasks running on those nodes need to be restarted. If the TERMINATED node is re-added by the cluster monitor, and the task is NOT submitted for restart, then the new node may sit idle (if the new node has the same name as the TERMINATED node).

Independent of node terminations, tasks can also stall due to programming bugs or unexpected resource contention. Failing to restart stalled tasks results in a node effectively sitting idle.

To detect tasks that need to be restarted, either due to a TERMINATED node or a stalled task, you can use the array_job_monitor.sh script in the grid-computing-tools github project, which will:

  • For each task allocated to a node:
    • Get the associated node’s uptime
      • Restart the task if
        • the node is down
        • the node’s uptime is less than the task’s running time (meaning that the node has been replaced since the task started)
        • the task runtime is longer than a configurable timeout interval (optional)

Note: when you launch your job on the Grid Engine cluster, be sure to mark the job as “restartable”. This can be done by passing the flag -r y to the qsub command.

Upload the job monitor script

The job monitor script must be run on the cluster’s frontend node. To upload array_job_monitor.sh:

elasticluster sftp gridengine << EOF
mkdir tools
put tools/array_job_monitor tools/
EOF

Run the job monitor script

To run the array_job_monitor.sh, ssh to the frontend instance:

elasticluster ssh gridengine

Parameters for array_job_monitor.sh are:

job_id
Grid Engine job ID to monitor
monitor_interval

Minutes to sleep between checks of running tasks

Default: 15 minutes

task_timeout

Number of minutes a task may run before it is considered stalled, and is eligible to be resubmitted.

Default: None

queue_name

Grid Engine job queue the job_id is associated with

Default: all.q

For example, to monitor job 1, every 5 minutes, for jobs that should not take more than 10 minutes:

./tools/array_job_monitor.sh 1 5 10

Have feedback or corrections? All improvements to these docs are welcome! You can click on the “Edit on GitHub” link at the top right corner of this page or file an issue.

Need more help? Please see https://cloud.google.com/genomics/support.