Run Picard and GATK tools on Cloud-Resident Genomic Data

The properly rendered version of this document can be found at Read The Docs.

If you are reading this on github, you should instead click here.

Introduction

Picard/GATK tools are command line utilities for genomic sequencing data processing that typically take BAM and other files as input and produce modified BAM files.

These tools are frequently chained together into pipelines to perform step-by-step processing of the sequencing data all the way from unaligned sequencer output to variant calls (e.g. see Broad best practices).


We are teaching these tools to take cloud based datasets as a possible input. The foundation for cloud data access is now in HTSJDK library and we have converted a number of Picard tools.

If your dataset is loaded into a cloud provider supporting GA4GH API (e.g. Google Genomics) or you use one of the available datasets from Discover Published Data, you will be able to run a Picard tool against it, reading data directly from the cloud.


New: see the video version of this tutorial.

Below is a step by step guide on how to build Picard tools with GA4GH support, set-up access to genomics data in the cloud and run the tools.

By the end of this tutorial you will be able run a Picard tool, giving it a URL identifying a genomic dataset in the cloud and see the output of processing the data directly from the cloud.

We also have detailed description of changes to HTSJDK and Picard to help you write your own cloud-aware client on top of HTSDK or help us convert more Picard tools.

Set up access to genomics data in Google Cloud

We will assume you are starting from a completely blank slate so please skip the steps that are redundant for you.

If you are already using Google Genomics API and have a project set up for this you can skip this section and go directly to Build Picard tools with GA4GH support.

These instructions are based on Genomics Tools tutorial.

Set up your account and a cloud project

If you don’t already have one, sign up for a Google Account

Sign up for Google Cloud Platform by clicking on this link: https://console.cloud.google.com/billing/freetrial

If you do not yet have a cloud project, create a Genomics and Cloud Storage enabled project via the Google Cloud Platform Console.

Install gcloud tool and validate access to genomics data

If you have not done so before, install gcloud tool: Show/Hide Instructions

Follow the Windows, Mac OS X or Linux instructions to install gcloud on your local machine: https://cloud.google.com/sdk/

  • Download and install the Google Cloud SDK by running this command in your shell or Terminal:
curl https://sdk.cloud.google.com | bash

Or, you can download google-cloud-sdk.zip or google-cloud-sdk.tar.gz, unpack it, and launch the ./google-cloud-sdk/install.sh script.

Restart your shell or Terminal.

  • Authenticate:
$ gcloud auth login
  • Configure the project:
$ gcloud config set project <YOUR_PROJECT_ID>
  • Install the Genomics tools:
$ gcloud components update alpha
  • Confirm the access to Genomics data works:
$ gcloud alpha genomics readgroupsets list 10473108253681171589 --limit 10
ID                      NAME     REFERENCE_SET_ID
CMvnhpKTFhDq9e2Yy9G-Bg  HG02573  EOSt9JOVhp3jkwE
CMvnhpKTFhCEmf_d_o_JCQ  HG03894  EOSt9JOVhp3jkwE
...

Set up credentials for programs accessing the genomics data

If you do not have it already, get your client_secrets.json file: Show/Hide Instructions

Get your client_secrets.json file by visiting the following page:

After you select your Google Cloud project, this link will automatically take you to the Credentials tab under the API Manager.

  1. Select New credentials
  2. Select OAuth client ID

If prompted, select Configure consent screen, and follow the instructions to set a “product name” to identify your Cloud project in the consent screen. Choose “Save”.

  1. Under Application Type choose Other
  2. Give your client ID a name, so you can remember why it was created (suggestion: Genomics Tools)
  3. Select Create

After successful creation, the interface should display your client ID and client secret.

To download the client_secrets.json file:

  1. Select OK to close the dialog
  2. Select the name of your new client id (which you specified in step 4)
  3. Select Download JSON

Note that by convention the downloaded file is referred to as client_secrets.json though the file name is something much longer.

Copy client_secrets.json to the directory where you installed the Genomics tools.

The first time you query the API you will be authenticated using the values in the client_secrets file you downloaded. After this initial authentication, the Genomics tools save a token to use during subsequent API requests.

Build Picard tools with GA4GH support

You will need Maven and Ant build tools installed on your machine.

  1. Fetch Picard, HTSJDK and gatk-tools-java projects required for building Picard with GA4GH support.
$ git clone https://github.com/broadinstitute/picard.git
$ cd picard
$ git clone https://github.com/samtools/htsjdk
$ cd ..
$ git clone https://github.com/googlegenomics/gatk-tools-java
  1. Build gatk-tools-java and copy the resulting JAR into Picard library folder:
$ cd gatk-tools-java
$ mvn compile package
$ mkdir ../picard/lib/gatk-tools-java
$ cp target/gatk-tools-java*minimized.jar ../picard/lib/gatk-tools-java/
  1. Build Picard version with GA4GH support:
// Assuming you are still in gatk-tools-java directory
$ cd ../picard
$ ant -lib lib/ant package-commands-ga4gh

4. Make sure you put client_secrets.json file in the parent folder just above Picard. You should end up with the following directory structure:

your_client_folder \
  client_secrets.json
  gatk-tools-java \
  picard \
    htsjdk \

Run Picard tools with an input from the cloud

You can now run ViewSam tool that prints the contents of the supplied INPUT

$ cd picard
$ java \
   -jar dist/picard.jar ViewSam \
   INPUT=https://www.googleapis.com/genomics/v1beta2/readgroupsets/CK256frpGBD44IWHwLP22R4/ \
   GA4GH_CLIENT_SECRETS=../client_secrets.json

This command uses an older, slower REST based API. To run using GRPC API implementation (which is much faster) use the following command that utilizes ALPN jars that come with gatk-tools-java and enables GRPC support:

java \
 -Xbootclasspath/p:../gatk-tools-java/lib/alpn-boot-8.1.3.v20150130.jar \
 -Dga4gh.using_grpc=true \
 -jar dist/picard.jar ViewSam \
 INPUT=https://www.googleapis.com/genomics/v1beta2/readgroupsets/CK256frpGBD44IWHwLP22R4/ \
 GA4GH_CLIENT_SECRETS=../client_secrets.json

For Java 7 (as opposed to 8) use alpn-boot-7.1.3.v20150130.jar.

We use a test readset here from genomics-test-data project.

Specifying a genomics region to use from the readset

The INPUT urls are of the form https://<GA4GH provider>/readgroupsets/<readgroupset id>[/sequence][/start-end].

For example:

java -jar dist/picard.jar ViewSam \
INPUT=https://www.googleapis.com/genomics/v1beta2/readgroupsets\
/CMvnhpKTFhD3he72j4KZuyc/chr17/41196311-41207499 \
GA4GH_CLIENT_SECRETS=../client_secrets.json

Timing the reading speed from the cloud

You can run gatk-tools-java/src/main/scripts/example.sh with and without “grpc” command line parameter to see the difference in reading speed. The timing statistics are dumped to the terminal. We benchmarked x11 speed improvements with GRPC compared to REST, giving ~12,000 reads/second.

The tests were done on Platinum Genomes NA12877_S1.bam dataset, please see the detailed writeup of the test procedure and results if you want to repeat the test.

We therefore recommend running GRPC variants of command line.

Other Picard tools you can run

You can run MarkDuplicates or MarkDuplicatesWithMateCigar tools like this:

java \
  -Xbootclasspath/p:../gatk-tools-java/lib/alpn-boot-8.1.3.v20150130.jar \
  -Dga4gh.using_grpc=true \
  -jar dist/picard.jar MarkDuplicates \
  INPUT=https://www.googleapis.com/genomics/v1beta2/readgroupsets/CK256frpGBD44IWHwLP22R4/ \
  OUTPUT=output.bam \
  METRICS_FILE=output.metrics \
  GA4GH_CLIENT_SECRETS=../client_secrets.json

Figuring out a url for your dataset

In the examples above we have been using urls of the form https://www.googleapis.com/genomics/v1beta2/readgroupsets/XXX where XXX is the id of the readset.

How do you find an ID of the readset from the Discover Published Data set or from your own project ?

We will do it step by step using the command line API client.

$ gcloud alpha genomics readgroupsets list 10473108253681171589 --limit 10
ID                      NAME     REFERENCE_SET_ID
CMvnhpKTFhDq9e2Yy9G-Bg  HG02573  EOSt9JOVhp3jkwE
CMvnhpKTFhCEmf_d_o_JCQ  HG03894  EOSt9JOVhp3jkwE
...
  • Note the NAME column - it will correspond to the file name. The ID column is the ID of the readgroupset we are looking for.

Now lets suppose we are not looking for one of the readgroupsets form the genomics public data but instead want to use one from our own project. In this case we need to figure out the dataset id for our files first, before we can use “readgroupsets list” command to list the individual readgroupsets.

  • Lets say we want to figure out which dataset ids are present under genomics test data project.
  • First we need to set the project id for subsequent commands to be our project using
$ gcloud config set project genomics-test-data
  • Now we can issue this command:
$ gcloud alpha genomics datasets list --limit 10
  • The output will list dataset(s) present in the project together with their ids and we can then use the “readgroupsets list” command to get the id of the readgroupset under one of the datasets.

Have feedback or corrections? All improvements to these docs are welcome! You can click on the “Edit on GitHub” link at the top right corner of this page or file an issue.

Need more help? Please see https://cloud.google.com/genomics/support.