BioC 2015: Where Software and Biology Connect

The properly rendered version of this document can be found at Read The Docs.

If you are reading this on github, you should instead click here.

This workshop was presented at the annual Bioconductor Developer’s Conference.

Google has some pretty amazing big data computational “hammers” that they have been applying to search and video data for a long time. In this workshop we take those same hammers and apply them to whole genome sequences.

We will work with both the 1,000 Genomes reads and variants and also the Illumina Platinum Genomes gVCF variants.

We do this all from the comfort of the R prompt using common packages including VariantAnnotation, ggbio, ggplot2, dplyr, bigrquery, and the new Bioconductor package GoogleGenomics which provides an R interface to Google’s implementation of the Global Alliance for Genomics and Health API.

And we’ll do this in a reproducible fashion running RMarkdown files via Dockerized Bioconductor running on Google Compute Engine VMs!

Get Started with Google Cloud Platform

Create a Google Cloud Platform project

Sign up for Google Cloud Platform by clicking on this link: https://console.cloud.google.com/billing/freetrial

Enable APIs

Enable all the Google Cloud Platform APIs we will use in this workshop by clicking on this link.

Install gcloud

Follow the Windows, Mac OS X or Linux instructions to install gcloud on your local machine: https://cloud.google.com/sdk/

  • Download and install the Google Cloud SDK by running this command in your shell or Terminal:
curl https://sdk.cloud.google.com | bash

Or, you can download google-cloud-sdk.zip or google-cloud-sdk.tar.gz, unpack it, and launch the ./google-cloud-sdk/install.sh script.

Restart your shell or Terminal.

  • Authenticate:
$ gcloud auth login
  • Configure the project:
$ gcloud config set project <YOUR_PROJECT_ID>

Set up Bioconductor

To further the goals of reproducibility, ease of use, and convenience, you can run this codelab in a Bioconductor Docker container deployed to Google Compute Engine. But this codelab can be run from anywhere since all the heavy lifting is happening in the cloud regardless of where R is running.

Bioconductor maintains Docker containers with R, Bioconductor packages, and RStudio Server all ready to go! Its a great way to set up your R environment quickly and start working. The instructions are below but if you want to learn more, see http://www.bioconductor.org/help/docker/.

  1. Click on click-to-deploy Bioconductor to navigate to the deployer page on the Cloud Platform Console.
  2. In field Docker Image choose item custom.
  3. Click on More to display the additional form fields.
  4. In field Custom docker image paste in value gcr.io/bioc_2015/devel_sequencing.
  5. Click on the Deploy Bioconductor button.
  6. Follow the post-deployment instructions to log into RStudioServer via your browser!
If you prefer to run this docker container locally, click here to Show/Hide Instructions

To run the docker container locally:

  1. Install Docker for your platform.
  2. Run command docker run gcr.io/bioc_2015/devel_sequencing

See https://github.com/googlegenomics/gce-images for the Docker file. It depends upon http://www.bioconductor.org/help/docker/ which depends upon https://github.com/rocker-org/rocker/wiki.

Note that its big, over 4GB, since it is derived from the Bioconductor Sequencing view and contains many annotation databases.

If you prefer to setup R manually instead, click here to Show/Hide Instructions
# Install BiocInstaller.
source("http://bioconductor.org/biocLite.R")
# See http://www.bioconductor.org/developers/how-to/useDevel/
useDevel()
# Install devtools which is needed for the special use of biocLite() below.
biocLite("devtools")
# Install the workshop material.
biocLite("googlegenomics/bioconductor-workshop-r", build_vignettes=TRUE, dependencies=TRUE)

Run the Codelabs

  1. View the workshop documentation.
help(package="GoogleGenomicsBioc2015Workshop")
  1. Click on “User guides, package vignettes and other documentation.”
  2. Early on in the workshop you will need an API_KEY. You can get this by clicking on this link: https://console.cloud.google.com/project/_/apiui/credential
  3. Click on vignette “Bioc2015Workshop” and follow the instructions there to run the vignettes line-by-line or chunk-by-chunk!
  • To run line-by-line, put your cursor on the desired line and click the “Run” button or use keyboard shortcuts for Windows/Linux: Ctrl+Enter and Mac: Command+Enter.
  • To run chunk-by-chunk, put your cursor in the desired chunk and click the “Chunks -> Run Current Chuck” button. or use keyboard shortcuts for Windows/Linux: Ctrl+Alt+C and Mac: Command+Option+C.
Run Rmarkdown

If you just want to read the rendered results of the four codelabs, here they are:

Wrap up

“Stop” or “Delete” your VM

If you would like to pause your VM when not using it:

  1. Go to the Google Cloud Platform Console and select your project: https://console.cloud.google.com/project/_/compute/instances
  2. Click on the checkbox next to your VM.
  3. Click on Stop to pause your VM.
  4. When you are ready to use it again, Start your VM. For more detail, see: https://cloud.google.com/compute/docs/instances/stopping-or-deleting-an-instance

If you want to delete your deployment:

  1. First copy any data off of the data disk that you wish to keep. The data disk will be deleted when the deployment is deleted.
  2. Click on Deployments to navigate to your deployment and delete it.

Stay Involved


Have feedback or corrections? All improvements to these docs are welcome! You can click on the “Edit on GitHub” link at the top right corner of this page or file an issue.

Need more help? Please see https://cloud.google.com/genomics/support.