Calculate Coverage¶
The properly rendered version of this document can be found at Read The Docs. If you are reading this on github, you should instead click here. |
Contents
This pipeline calculates mean read depth coverage for a given dataset or set of read group sets and writes the results to Annotations in a new AnnotationSet using the Genomics API.
For each “bucket” in the given input references, this computes the average coverage (rounded to six decimal places) across the bucket that 10%, 20%, 30%, etc. of the input read group sets have for each mapping quality of the reads (<10:Low(L), 10-29:Medium(M), >=30:High(H)) as well as these same percentiles of read group sets for all reads regardless of mapping quality (Mapping quality All(A)).
There is also the option to change the number of quantiles accordingly (numQuantiles = 5 would give you the minimum read group set mean coverage for each and across all mapping qualities, the 25th, 50th, and 75th percentiles, and the maximum of these values).
The pipeline is implemented on Google Cloud Dataflow.
Setup Dataflow¶
Most users launch Dataflow jobs from their local machine. This is unrelated to where the job itself actually runs (which is controlled by the --runner
parameter). Either way, Java 8 is needed to run the Jar that kicks off the job.
- If you have not already done so, follow the Genomics Quickstart.
- If you have not already done so, follow the Dataflow Quickstart including installing gcloud and running
gcloud init
.
If you do not have Java on your local machine, the following setup instructions will allow you to launch Dataflow jobs using the Google Cloud Shell:
- If you have not already done so, follow the Genomics Quickstart.
- If you have not already done so, follow the Dataflow Quickstart.
- Use the Cloud Console to activate the Google Cloud Shell.
- Run the following commands in the Cloud Shell to install Java 8.
sudo apt-get update
sudo apt-get install --assume-yes openjdk-8-jdk maven
sudo update-alternatives --config java
sudo update-alternatives --config javac
Note
Depending on the pipeline, Cloud Shell may not not have sufficient memory to run the pipeline locally (e.g., without the --runner
command line flag). If you get error java.lang.OutOfMemoryError: Java heap space
, follow the instructions to run the pipeline using Compute Engine Dataflow workers instead of locally (e.g. use --runner=DataflowPipelineRunner
).
If you want to run a small pipeline on your machine before running it in parallel on Compute Engine, you will need ALPN since many of these pipelines require it. When running locally, this must be provided on the boot classpath but when running on Compute Engine Dataflow workers this is already configured for you. You can download it from here. For example:
wget -O alpn-boot.jar \
http://central.maven.org/maven2/org/mortbay/jetty/alpn/alpn-boot/8.1.8.v20160420/alpn-boot-8.1.8.v20160420.jar
Download the latest GoogleGenomics dataflow runnable jar from the Maven Central Repository. For example:
wget -O google-genomics-dataflow-runnable.jar \
https://search.maven.org/remotecontent?filepath=com/google/cloud/genomics/google-genomics-dataflow/v1-0.1/google-genomics-dataflow-v1-0.1-runnable.jar
Create Output Dataset¶
In order to run this pipeline, you must have a Google Genomics dataset to which the pipeline can output its AnnotationSet and Annotations.
- If you already have a dataset in which you have write access, you may use it. Click here to see your datasets: https://console.cloud.google.com/project/_/genomics/datasets
- If not, you can click on the following link to use the Cloud Platform Console to create one: https://console.cloud.google.com/project/_/genomics/datasets/create.
In either case, the ID
of the dataset is the output dataset id you should use when running
the pipeline.
Run the pipeline¶
The following command will calculate the mean coverage as described above for a given genomic region, using a bucket width of 1024 (in this case one bucket output) on the Illumina Platinum Genomes dataset:
java -Xbootclasspath/p:alpn-boot.jar \
-cp google-genomics-dataflow-runnable.jar \
com.google.cloud.genomics.dataflow.pipelines.CalculateCoverage \
--references=chr1:552960:553984 \
--bucketWidth=1024 \
--inputDatasetId=3049512673186936334 \
--outputDatasetId=YOUR-OUTPUT-DATASET-ID
This can take several minutes to run. You can check your results by using the Genomics API Explorer:
- First go to the AnnotationSets search request page to determine what your newly created AnnotationSetId is.
- Put your output dataset id in the
datasetIds
field.- Press the Authorize and Execute button.
- Then go to the Annotations search request page to be able to see your newly created Annotation.
- Put the AnnotationSetId you just found in the
annotationSetIds
field.- Select
info
andposition
in the fields editor.- Press the Authorize and Execute button.
- Your Annotation should look like this:
{
"annotations": [
{
"position": {
"referenceId": "CNfS6aHAoved2AEQy9ao_KOKwa43",
"referenceName": "chr1",
"start": "552960",
"end": "553984"
},
"info": {
"A": [
"26.623047",
"28.424805",
"35.042969",
"35.083984",
"36.039063",
"39.678711",
"46.819336",
"52.219727",
"52.681641",
"56.575195",
"62.339844"
],
"H": [
"0.196289",
"0.196289",
"0.197266",
"0.393555",
"0.59082",
"0.59082",
"0.788086",
"0.956055",
"1.27832",
"1.345703",
"1.772461"
],
"L": [
"16.304688",
"17.844727",
"21.004883",
"23.180664",
"24.850586",
"24.894531",
"26.427734",
"29.884766",
"29.933594",
"32.101563",
"32.962891"
],
"M": [
"9.96875",
"10.036133",
"10.12207",
"10.383789",
"12.661133",
"13.644531",
"14.201172",
"22.845703",
"24.141602",
"25.765625",
"27.604492"
]
}
}
]
}
The following command will also calculate the mean coverage in the same manner as the previous command, but will use a select number of read group sets from the Illumina Platinum Genomes instead of the entire dataset, namely those for NA12883, NA12884, and NA12885. To do this, we must change the number of quantiles we are computing, as we now have fewer read group sets then the default requirement of 11:
java -Xbootclasspath/p:alpn-boot.jar \
-cp google-genomics-dataflow-runnable.jar \
com.google.cloud.genomics.dataflow.pipelines.CalculateCoverage \
--references=chr1:552960:553984 \
--bucketWidth=1024 \
--numQuantiles=3 \
--readGroupSetIds=CMvnhpKTFhCAv6TKo6Dglgg,CMvnhpKTFhDw8e3V6aCB-Q8,CMvnhpKTFhDo08GNkfe-jxo \
--outputDatasetId=YOUR_OUTPUT_DATASET_ID
This command should run a bit faster then the above command. You can check your results the same way as described above, except now your Annotation should look like this:
{
"annotations": [
{
"position": {
"referenceId": "CNfS6aHAoved2AEQy9ao_KOKwa43",
"referenceName": "chr1",
"start": "552960",
"end": "553984"
},
"info": {
"A": [
"35.042969",
"51.039063",
"56.575195"
],
"H": [
"0.393555",
"0.956055",
"1.345703"
],
"L": [
"21.004883",
"25.59375",
"31.087891"
],
"M": [
"13.644531",
"24.141602",
"24.489258"
]
}
}
]
}
The above command line runs the pipeline locally over a small portion of the genome, only taking a few minutes. If modified to run over a larger portion of the genome or the entire genome, it may take a few hours depending upon how many virtual machines are configured to run concurrently via --numWorkers
. Add the following additional command line parameters to run the pipeline on Google Cloud instead of locally:
--runner=DataflowPipelineRunner \
--project=YOUR-GOOGLE-CLOUD-PLATFORM-PROJECT-ID \
--stagingLocation=gs://YOUR-BUCKET/dataflow-staging \
--numWorkers=#
Use a comma-separated list to run over multiple disjoint regions. For example to run over BRCA1 and BRCA2 --references=chr13:32889610:32973808,chr17:41196311:41277499
.
To run this pipeline over the entire genome, use --allReferences
instead of --references=chr17:41196311:41277499
.
To run the pipeline on a different dataset, change the --inputDatasetId
parameter.
To run the pipeline on a different group of read group sets, change the --readGroupSetIds
parameter.
To run the pipeline with a different bucket width, change the --bucketWidth
parameter.
To run the pipeline with a different number of output quantiles, change the --numQuantiles
parameter.
Additional details¶
If the Application Default Credentials are not sufficient, use --client-secrets PATH/TO/YOUR/client_secrets.json
. If you do not already have this file, see the authentication instructions to obtain it.
Use --help
to get more information about the command line options. Change
the pipeline class name below to match the one you would like to run.
java -cp google-genomics-dataflow*runnable.jar \
com.google.cloud.genomics.dataflow.pipelines.VariantSimilarity --help
See the source code for implementation details: https://github.com/googlegenomics/dataflow-java