Compute Principal Coordinate Analysis

The properly rendered version of this document can be found at Read The Docs.

If you are reading this on github, you should instead click here.

Principal Coordinate Analysis counts the number of variants two samples have in common. These counts are then placed into an NxN matrix where N is the number of samples in the variant set. The matrix is centered, scaled, and then the first two principal components are computed for each individual.

See the Data Analysis using Google Genomics codelab for an example that makes use of the results of this analysis run upon 1,000 Genomes.

Both Google Cloud Dataflow and Apache Spark implementations are available.

Dataflow

Setup

To launch the job from your local machine: Show/Hide Instructions

Most users launch Dataflow jobs from their local machine. This is unrelated to where the job itself actually runs (which is controlled by the --runner parameter). Either way, Java 8 is needed to run the Jar that kicks off the job.

  1. If you have not already done so, follow the Genomics Quickstart.
  2. If you have not already done so, follow the Dataflow Quickstart including installing gcloud and running gcloud init.
To launch the job from Google Cloud Shell: Show/Hide Instructions

If you do not have Java on your local machine, the following setup instructions will allow you to launch Dataflow jobs using the Google Cloud Shell:

  1. If you have not already done so, follow the Genomics Quickstart.
  2. If you have not already done so, follow the Dataflow Quickstart.
  3. Use the Cloud Console to activate the Google Cloud Shell.
  4. Run the following commands in the Cloud Shell to install Java 8.
sudo apt-get update
sudo apt-get install --assume-yes openjdk-8-jdk maven
sudo update-alternatives --config java
sudo update-alternatives --config javac

Note

Depending on the pipeline, Cloud Shell may not not have sufficient memory to run the pipeline locally (e.g., without the --runner command line flag). If you get error java.lang.OutOfMemoryError: Java heap space, follow the instructions to run the pipeline using Compute Engine Dataflow workers instead of locally (e.g. use --runner=DataflowPipelineRunner).

If you want to run a small pipeline on your machine before running it in parallel on Compute Engine, you will need ALPN since many of these pipelines require it. When running locally, this must be provided on the boot classpath but when running on Compute Engine Dataflow workers this is already configured for you. You can download it from here. For example:

wget -O alpn-boot.jar \
  http://central.maven.org/maven2/org/mortbay/jetty/alpn/alpn-boot/8.1.8.v20160420/alpn-boot-8.1.8.v20160420.jar

Download the latest GoogleGenomics dataflow runnable jar from the Maven Central Repository. For example:

wget -O google-genomics-dataflow-runnable.jar \
  https://search.maven.org/remotecontent?filepath=com/google/cloud/genomics/google-genomics-dataflow/v1-0.1/google-genomics-dataflow-v1-0.1-runnable.jar

Run the pipeline

The following command will run PCA over the BRCA1 region within the Illumina Platinum Genomes variant set.

java -Xbootclasspath/p:alpn-boot.jar \
  -cp google-genomics-dataflow-runnable.jar \
  com.google.cloud.genomics.dataflow.pipelines.VariantSimilarity \
  --variantSetId=3049512673186936334 \
  --references=chr17:41196311:41277499 \
  --output=gs://YOUR-BUCKET/dataflow-output/platinum-genomes-brca1-pca.tsv

The above command line runs the pipeline locally over a small portion of the genome, only taking a few minutes. If modified to run over a larger portion of the genome or the entire genome, it may take a few hours depending upon how many virtual machines are configured to run concurrently via --numWorkers. Add the following additional command line parameters to run the pipeline on Google Cloud instead of locally:

--runner=DataflowPipelineRunner \
--project=YOUR-GOOGLE-CLOUD-PLATFORM-PROJECT-ID \
--stagingLocation=gs://YOUR-BUCKET/dataflow-staging \
--numWorkers=#

Use a comma-separated list to run over multiple disjoint regions. For example to run over BRCA1 and BRCA2 --references=chr13:32889610:32973808,chr17:41196311:41277499.

To run this pipeline over the entire genome, use --allReferences instead of --references=chr17:41196311:41277499.

To run the pipeline on a different variant set:

  • Change the variant set id for the --variantSetId id parameter.
  • Update the --references as appropriate (e.g., add/remove the ‘chr’ prefix on reference names).

Additional details

If the Application Default Credentials are not sufficient, use --client-secrets PATH/TO/YOUR/client_secrets.json. If you do not already have this file, see the authentication instructions to obtain it.

Use --help to get more information about the command line options. Change the pipeline class name below to match the one you would like to run.

java -cp google-genomics-dataflow*runnable.jar \
  com.google.cloud.genomics.dataflow.pipelines.VariantSimilarity --help

See the source code for implementation details: https://github.com/googlegenomics/dataflow-java

Spark

Setup

  • Deploy your Spark cluster using Google Cloud Dataproc. This can be done using the Cloud Platform Console or the following gcloud command:

    gcloud beta dataproc clusters create example-cluster --scopes cloud-platform
    
  • ssh to the master.

    gcloud compute ssh example-cluster-m
    
  • Compile and build the pipeline jar. You can build locally or build on the Spark master Google Compute Engine virtual machine.

To compile and build on Compute Engine: Show/Hide Instructions
  1. Install sbt.
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
sudo apt-get install apt-transport-https
sudo apt-get update
sudo apt-get install sbt
  1. Clone the github repository.
sudo apt-get install git
git clone https://github.com/googlegenomics/spark-examples.git
  1. Compile the Jar.
cd spark-examples
sbt assembly
cp target/scala-2.*/googlegenomics-spark-examples-assembly-*.jar ~/
cd ~/

Run the job

The following command will run PCA over the BRCA1 region within the Illumina Platinum Genomes variant set.

spark-submit \
  --class com.google.cloud.genomics.spark.examples.VariantsPcaDriver \
  --conf spark.shuffle.spill=true \
  googlegenomics-spark-examples-assembly-1.0.jar \
  --variant-set-id 3049512673186936334 \
  --references chr17:41196311:41277499 \
  --output-path gs://YOUR-BUCKET/output/platinum-genomes-brca1-pca.tsv

The above command line runs the job over a small portion of the genome, only taking a couple minutes. If modified to run over a larger portion of the genome or the entire genome, it may take a few hours depending upon how many machines are in the Spark cluster.

To run this job over a large portion of the genome or the entire genome:

  • Create a larger cluster: gcloud beta dataproc clusters create cluster-2 --scopes cloud-platform --num-workers #
  • Add --num-reduce-partitions # to be equal to the number of cores in your cluster.
  • Use a comma-separated list to run over multiple disjoint regions. For example to run over BRCA1 and BRCA2 --references chr13:32889610:32973808,chr17:41196311:41277499.
  • Use --all-references instead of --references chr17:41196311:41277499 to run over the entire genome.

To run the job on a different variant set:

  • Change the variant set id for the --variant-set-id id parameter.
  • Update the --references as appropriate (e.g., add/remove the ‘chr’ prefix on reference names).

Additional details

If the Application Default Credentials are not sufficient, use --client-secrets=PATH/TO/YOUR/client_secrets.json. If you do not already have this file, see the authentication instructions to obtain it.

Use --help to get more information about the job-specific command line options. Change the job class name below to match the one you would like to run.

spark-submit --class com.google.cloud.genomics.spark.examples.VariantsPcaDriver \
  googlegenomics-spark-examples-assembly-1.0.jar  --help

See the source code for implementation details: https://github.com/googlegenomics/spark-examples

Gather the results into a single file

gsutil cat gs://YOUR-BUCKET/output/platinum-genomes-brca1-pca.tsv* \
  | sort > platinum-genomes-brca1-pca.tsv

Have feedback or corrections? All improvements to these docs are welcome! You can click on the “Edit on GitHub” link at the top right corner of this page or file an issue.

Need more help? Please see https://cloud.google.com/genomics/support.