Run workflows and common tasks in parallel

The properly rendered version of this document can be found at Read The Docs.

If you are reading this on github, you should instead click here.

Do you have a task that you need to run independently over dozens, hundreds, or thousands of files in Google Cloud Storage? The Google Genomics Pipelines API provides an easy way to launch and monitor tasks running in the cloud.

Alpha
This is an Alpha release of Google Genomics API. This feature might be changed in backward-incompatible ways and is not recommended for production use. It is not subject to any SLA or deprecation policy.

Overview

A “pipeline” in its simplest form is a task consisting of:

  • Path(s) of input files to read from Cloud Storage
  • Path(s) of output files/directories to write to Cloud Storage
  • A Docker image to run
  • A command to run in the Docker image
  • Cloud resources to use (number of CPUs, amount of memory, disk size and type)

The Pipelines API will:

  1. Create a Compute Engine virtual machine
  2. Download the Docker image
  3. Download the input files
  4. Run a new Docker container with the specified image and command
  5. Upload the output files
  6. Destroy the Compute Engine virtual machine

Log files are uploaded periodically to Cloud Storage.

Alternatives

For many cases, the Pipelines API has an advantage over fixed clusters in that Compute Engine resources (virtual machines and disks) are allocated only for the lifetime of the running pipeline, and are then destroyed.

However many existing scripts assume a fixed cluster (such as a shared disk). If you want to create a fixed cluster, see Create a Grid Engine cluster on Compute Engine

Getting started examples

We have a github repository with several pipelines-api-examples to help you get started.

See the README at the top of the repository for prerequisites. Existing examples include:

Beyond Files

Note that the Pipelines API is not only for working with files. If you have tools that access data in Google Genomics, Google BigQuery, or any other Google Cloud API, they can be run using the Pipelines API.

When running a pipeline, simply include the appropriate OAuth 2.0 Scope for the Compute Engine ServiceAccount.


Have feedback or corrections? All improvements to these docs are welcome! You can click on the “Edit on GitHub” link at the top right corner of this page or file an issue.

Need more help? Please see https://cloud.google.com/genomics/support.