The properly rendered version of this document can be found at Read The Docs.
If you are reading this on github, you should instead click here.
Do you have a task that you need to run independently over dozens, hundreds, or thousands of files in Google Cloud Storage? The Google Genomics Pipelines API provides an easy way to launch and monitor tasks running in the cloud.
Alpha This is an Alpha release of Google Genomics API. This feature might be changed in backward-incompatible ways and is not recommended for production use. It is not subject to any SLA or deprecation policy.
A “pipeline” in its simplest form is a task consisting of:
- Path(s) of input files to read from Cloud Storage
- Path(s) of output files/directories to write to Cloud Storage
- A Docker image to run
- A command to run in the Docker image
- Cloud resources to use (number of CPUs, amount of memory, disk size and type)
The Pipelines API will:
- Create a Compute Engine virtual machine
- Download the Docker image
- Download the input files
- Run a new Docker container with the specified image and command
- Upload the output files
- Destroy the Compute Engine virtual machine
Log files are uploaded periodically to Cloud Storage.
For many cases, the Pipelines API has an advantage over fixed clusters in that Compute Engine resources (virtual machines and disks) are allocated only for the lifetime of the running pipeline, and are then destroyed.
However many existing scripts assume a fixed cluster (such as a shared disk). If you want to create a fixed cluster, see Create a Grid Engine cluster on Compute Engine
Getting started examples¶
We have a github repository with several pipelines-api-examples to help you get started.
See the README at the top of the repository for prerequisites. Existing examples include:
Note that the Pipelines API is not only for working with files. If you have tools that access data in Google Genomics, Google BigQuery, or any other Google Cloud API, they can be run using the Pipelines API.