Compute Identity By State¶
The properly rendered version of this document can be found at Read The Docs. If you are reading this on github, you should instead click here. |
Contents
Identity-by-State is a simple similarity measure that describes the alleles shared by two individuals as a single number.
See the Quality Control using Google Genomics codelab for an example that makes use of the results of this analysis run upon Illumina Platinum Genomes.
A Google Cloud Dataflow implementation is available.
Setup Dataflow¶
Most users launch Dataflow jobs from their local machine. This is unrelated to where the job itself actually runs (which is controlled by the --runner
parameter). Either way, Java 8 is needed to run the Jar that kicks off the job.
- If you have not already done so, follow the Genomics Quickstart.
- If you have not already done so, follow the Dataflow Quickstart including installing gcloud and running
gcloud init
.
If you do not have Java on your local machine, the following setup instructions will allow you to launch Dataflow jobs using the Google Cloud Shell:
- If you have not already done so, follow the Genomics Quickstart.
- If you have not already done so, follow the Dataflow Quickstart.
- Use the Cloud Console to activate the Google Cloud Shell.
- Run the following commands in the Cloud Shell to install Java 8.
sudo apt-get update
sudo apt-get install --assume-yes openjdk-8-jdk maven
sudo update-alternatives --config java
sudo update-alternatives --config javac
Note
Depending on the pipeline, Cloud Shell may not not have sufficient memory to run the pipeline locally (e.g., without the --runner
command line flag). If you get error java.lang.OutOfMemoryError: Java heap space
, follow the instructions to run the pipeline using Compute Engine Dataflow workers instead of locally (e.g. use --runner=DataflowPipelineRunner
).
If you want to run a small pipeline on your machine before running it in parallel on Compute Engine, you will need ALPN since many of these pipelines require it. When running locally, this must be provided on the boot classpath but when running on Compute Engine Dataflow workers this is already configured for you. You can download it from here. For example:
wget -O alpn-boot.jar \
http://central.maven.org/maven2/org/mortbay/jetty/alpn/alpn-boot/8.1.8.v20160420/alpn-boot-8.1.8.v20160420.jar
Download the latest GoogleGenomics dataflow runnable jar from the Maven Central Repository. For example:
wget -O google-genomics-dataflow-runnable.jar \
https://search.maven.org/remotecontent?filepath=com/google/cloud/genomics/google-genomics-dataflow/v1-0.1/google-genomics-dataflow-v1-0.1-runnable.jar
Run the pipeline¶
The following command will run Identity-by-State over the BRCA1 region within the Illumina Platinum Genomes variant set.
java -Xbootclasspath/p:alpn-boot.jar \
-cp google-genomics-dataflow-runnable.jar \
com.google.cloud.genomics.dataflow.pipelines.IdentityByState \
--variantSetId=3049512673186936334 \
--references=chr17:41196311:41277499 \
--hasNonVariantSegments \
--output=gs://YOUR-BUCKET/dataflow-output/platinum-genomes-brca1-ibs.tsv
Note that there are several IBS calculators from which to choose. Use the --callSimilarityCalculatorFactory
to switch between them.
Also notice use of the --hasNonVariantSegments
parameter when running this pipeline on the Illumina Platinum Genomes variant set.
- For data with non-variant segments (such as Complete Genomics data or data in Genome VCF (gVCF) format), specify this flag so that the pipeline correctly takes into account non-variant segment records that overlap variants within the variant set.
- The source Illumina Platinum Genomes data imported into Google Genomics was in gVCF format.
The above command line runs the pipeline locally over a small portion of the genome, only taking a few minutes. If modified to run over a larger portion of the genome or the entire genome, it may take a few hours depending upon how many virtual machines are configured to run concurrently via --numWorkers
. Add the following additional command line parameters to run the pipeline on Google Cloud instead of locally:
--runner=DataflowPipelineRunner \
--project=YOUR-GOOGLE-CLOUD-PLATFORM-PROJECT-ID \
--stagingLocation=gs://YOUR-BUCKET/dataflow-staging \
--numWorkers=#
Use a comma-separated list to run over multiple disjoint regions. For example to run over BRCA1 and BRCA2 --references=chr13:32889610:32973808,chr17:41196311:41277499
.
To run this pipeline over the entire genome, use --allReferences
instead of --references=chr17:41196311:41277499
.
To run the pipeline on a different variant set:
- Change the variant set id for the
--variantSetId
id parameter. - Update the
--references
as appropriate (e.g., add/remove the ‘chr’ prefix on reference names). - Remove the
--nonVariantSegments
parameter if it is not applicable.
Gather the results into a single file¶
gsutil cat gs://YOUR-BUCKET/output/platinum-genomes-brca1-ibs.tsv* \
| sort > platinum-genomes-brca1-ibs.tsv
Additional details¶
If the Application Default Credentials are not sufficient, use --client-secrets PATH/TO/YOUR/client_secrets.json
. If you do not already have this file, see the authentication instructions to obtain it.
Use --help
to get more information about the command line options. Change
the pipeline class name below to match the one you would like to run.
java -cp google-genomics-dataflow*runnable.jar \
com.google.cloud.genomics.dataflow.pipelines.VariantSimilarity --help
See the source code for implementation details: https://github.com/googlegenomics/dataflow-java