Transform Linkage Disequilibrium Results

The properly rendered version of this document can be found at Read The Docs.

If you are reading this on github, you should instead click here.

These pipelines take linkage disequilibrium (LD) results generated by the Compute Linkage Disequilibrium on a Variant Set pipeline and transforms them into Cloud BigQuery and Cloud BigTable datasets that can be efficiently analyzed. Each pipeline takes as input a set of LD results and exports, transforms, and loads the results into the appropriate target data store.

The pipelines are implemented on Google Cloud Dataflow.

Load Linkage Disequilibrium Data into Cloud BigQuery

The output of the LinkageDisequilibrium pipeline is a file of comma-separated values. This is a standard file format for BigQuery ingestion. General instructions for loading data into BigQuery are available here. An example script for loading data from a CSV is available at load_data_from_csv.py.

When using that script, the schema for the table of LD results is available in the linkage disequilibrium repository in the ld_bigquery_schema_fields.txt file.

Consequently, LD data can be loaded into a BigQuery table with the following code snippet:

PROJECTID=<your-project-id>
DATASETID=<your-bigquery-dataset-id>
TABLE=<your-desired-bigquery-table-name>
DATA=<path-to-linkage-disequilibrium-result-data>

python path/to/load_data_from_csv.py \
  $PROJECTID $DATASETID $TABLE schema/ld_bigquery_schema_fields.txt $DATA

Setup Dataflow

To launch the job from your local machine: Show/Hide Instructions

Most users launch Dataflow jobs from their local machine. This is unrelated to where the job itself actually runs (which is controlled by the --runner parameter). Either way, Java 8 is needed to run the Jar that kicks off the job.

  1. If you have not already done so, follow the Genomics Quickstart.
  2. If you have not already done so, follow the Dataflow Quickstart including installing gcloud and running gcloud init.
To launch the job from Google Cloud Shell: Show/Hide Instructions

If you do not have Java on your local machine, the following setup instructions will allow you to launch Dataflow jobs using the Google Cloud Shell:

  1. If you have not already done so, follow the Genomics Quickstart.
  2. If you have not already done so, follow the Dataflow Quickstart.
  3. Use the Cloud Console to activate the Google Cloud Shell.
  4. Run the following commands in the Cloud Shell to install Java 8.
sudo apt-get update
sudo apt-get install --assume-yes openjdk-8-jdk maven
sudo update-alternatives --config java
sudo update-alternatives --config javac

Note

Depending on the pipeline, Cloud Shell may not not have sufficient memory to run the pipeline locally (e.g., without the --runner command line flag). If you get error java.lang.OutOfMemoryError: Java heap space, follow the instructions to run the pipeline using Compute Engine Dataflow workers instead of locally (e.g. use --runner=DataflowPipelineRunner).

If you want to run a small pipeline on your machine before running it in parallel on Compute Engine, you will need ALPN since many of these pipelines require it. When running locally, this must be provided on the boot classpath but when running on Compute Engine Dataflow workers this is already configured for you. You can download it from here. For example:

wget -O alpn-boot.jar \
  http://central.maven.org/maven2/org/mortbay/jetty/alpn/alpn-boot/8.1.8.v20160420/alpn-boot-8.1.8.v20160420.jar

Build the Linkage Disequilibrium jar:

git clone https://github.com/googlegenomics/linkage-disequilibrium.git
cd linkage-disequilibrium
mvn package

Load Linkage Disequilibrium Data into Cloud BigTable

Because BigTable allows efficient access to extremely large datasets indexed by a single key, it is a natural choice for representation of LD data. The WriteLdBigtable pipeline converts data generated by the Compute Linkage Disequilibrium on a Variant Set pipeline and writes the results into a BigTable using Dataflow. The key for each BigTable row is designed so that all LD results for a single query variant appear in a contiguous block of the table, sorted by the location of the target variants, and results for query variants are sorted by the location of query variants. This key design allows efficient access to all LD results for a single variant or a single region of the genome.

The following command will load LD results into an existing BigTable:

java -Xbootclasspath/p:alpn-boot.jar \
  -cp target/linkage-disequilibrium*runnable.jar \
  com.google.cloud.genomics.dataflow.pipelines.WriteLdBigtable \
  --bigtableProjectId=YOUR_BIGTABLE_PROJECT_ID \
  --bigtableClusterId=YOUR_BIGTABLE_CLUSTER_ID \
  --bigtableZoneId=YOUR_BIGTABLE_ZONE_ID \
  --bigtableTableId=YOUR_BIGTABLE_TABLE_ID \
  --ldInput="gs://YOUR-BUCKET/PATH-TO-DIRECTORY-WITH-LD-RESULTS/\*"

The above command line runs the pipeline locally over a small portion of the genome, only taking a few minutes. If modified to run over a larger portion of the genome or the entire genome, it may take a few hours depending upon how many virtual machines are configured to run concurrently via --numWorkers. Add the following additional command line parameters to run the pipeline on Google Cloud instead of locally:

--runner=DataflowPipelineRunner \
--project=YOUR-GOOGLE-CLOUD-PLATFORM-PROJECT-ID \
--stagingLocation=gs://YOUR-BUCKET/dataflow-staging \
--numWorkers=#

Retrieve Linkage Disequilibrium Data from Cloud BigTable

Once a BigTable storing LD data has been created, a mechanism for accessing the results must be created. The QueryLdBigtable pipeline provides an example in which Dataflow is used to read a subset of data from an LD BigTable and write the results to GCS in the same format as it was originally written by the Compute Linkage Disequilibrium on a Variant Set pipeline.

The following command will query LD results for a specific region of the genome and write results to a Cloud bucket:

java -Xbootclasspath/p:alpn-boot.jar \
  -cp target/linkage-disequilibrium*runnable.jar \
  com.google.cloud.genomics.dataflow.pipelines.QueryLdBigtable \
  --bigtableProjectId=YOUR_BIGTABLE_PROJECT_ID \
  --bigtableClusterId=YOUR_BIGTABLE_CLUSTER_ID \
  --bigtableZoneId=YOUR_BIGTABLE_ZONE_ID \
  --bigtableTableId=YOUR_BIGTABLE_TABLE_ID \
  --queryRange="17:41196311-41277499"
  --resultLocation="gs://YOUR-BUCKET/PATH-TO-OUTPUT-FILE"

Use a comma-separated list to run over multiple disjoint regions. For example to run over BRCA1 and BRCA2 --references=chr13:32889610:32973808,chr17:41196311:41277499.

To run this pipeline over the entire genome, use --allReferences instead of --references=chr17:41196311:41277499.

Additional details

Use --help to get more information about the command line options. Change the pipeline class name below to match the one you would like to run.

java -cp target/linkage-disequilibrium*runnable.jar \
  com.google.cloud.genomics.dataflow.pipelines.LinkageDisequilibrium \
  --help=com.google.cloud.genomics.dataflow.pipelines.LinkageDisequilibrium\$LinkageDisequilibriumOptions

See the source code for implementation details: https://github.com/googlegenomics/linkage-disequilibrium


Have feedback or corrections? All improvements to these docs are welcome! You can click on the “Edit on GitHub” link at the top right corner of this page or file an issue.

Need more help? Please see https://cloud.google.com/genomics/support.