Run SAMtools to index BAM files in Cloud Storage¶
The properly rendered version of this document can be found at Read The Docs. If you are reading this on github, you should instead click here. |
Suppose you have thousands of BAMs, which you have stored in Google Cloud Storage, and you need to create index files (BAI) for them.
The Grid Computing Tools github repo provides code and instructions for running SAMtools many times in parallel to create those index files.
Overview¶
This tool takes advantage of two key technologies to process a large number of files:
Google Compute Engine provides virtual machines in the cloud. With sufficient quota in your Google Cloud project, you can start dozens or hundreds of instances concurrently. The more instances you add to your cluster, the more quickly you can process your files.
Grid Engine is used to distribute the file operation tasks across all of the instances such that each instance takes the responsibility to download a single file, run the operation, and upload it back to Cloud Storage.
Directory structure¶
To use the tool, you will need to download both the Grid Computing Tools github repo and the Elasticluster repo to your local workstation or laptop.
No specific relationship exists between these two repositories, but in the following instructions, it is assumed that the directories:
grid-computing-toolselasticluster
are siblings under a workspace root (WS_ROOT) directory.
Running the samples¶
The quickest way to get familiar with the samtools tool is by trying the
sample.
The sample provided here lists just 6 files to work on, and the instructions below demonstrate spreading the processing over 3 worker instances.
Create a cluster of Compute Engine instances running Grid Engine
In your current shell:
cd ${WS_ROOT}- Follow the instructions to Create a Grid Engine cluster on Compute Engine
Download the
grid-computing-toolsrepository (if you have not already done so)cd ${WS_ROOT} git clone https://github.com/googlegenomics/grid-computing-tools.git
Upload the
srcandsamplesdirectories to the Grid Engine master instance:
cd grid-computing-tools elasticluster sftp gridengine << 'EOF' mkdir src mkdir src/common mkdir src/samtools put src/common/* src/common/ put src/samtools/* src/samtools/ mkdir samples mkdir samples/samtools put samples/samtools/* samples/samtools/ EOF
SSH to the master instance
elasticluster ssh gridengine
Set up the configuration files for the samples
The syntax for running the sample is:
./src/samtools/launch_samtools.sh [config_file]
The
config_filelists two sets of key parameters:- What operation to perform
- What are the source and destination locations
The operation to perform is controlled by the following:- SAMTOOLS_OPERATION:
index
The locations are determined by:- INPUT_LIST_FILE: file containing a list of GCS paths to the input files to process
- OUTPUT_PATH: GCS path indicating where to upload the output files.
If set to
source, the output will be written to the same path as the source file (with the extension.baiappended) - OUTPUT_LOG_PATH: (optional) GCS path indicating where to upload log files
To use the samples, you must update theOUTPUT_PATHandOUTPUT_LOG_PATHto contain a valid GCS bucket name. The sample config file sets a placeholder for theOUTPUT_PATHandOUTPUT_LOG_PATHsuch as:export OUTPUT_PATH=gs://MY_BUCKET/output_path/samtools_index export OUTPUT_LOG_PATH=gs://MY_BUCKET/log_path/samtools_index
You can do this manually with the editor of your choice or you can change the
configfile with the command:sed --in-place -e 's#MY_BUCKET#your_bucket#' samples/samtools/*_config.shWhere
your_bucketshould be replaced with the name of a GCS bucket in your Cloud project to which you have write access.Run the sample:
- Index a list of files using
samtools index[ Estimated time to complete: 5 minutes ]
./src/samtools/launch_samtools.sh ./samples/samtools/samtools_index_config.sh
When successfully launched, Grid Engine should emit a message such as:
Your job-array 1.1-6:1 ("samtools") has been submitted
This message tells you that the submitted job is a gridengine array job. The above message indicates that the job id is 1 and that the tasks are numbered 1 through 6. The name of the job samtools is also indicated.
- Index a list of files using
Monitoring the status of your job
Grid Engine provides the
qstatcommand to get the status of the execution queue.While the job is in the queue, the state column will indicate the status of each task. Tasks not yet allocated to a
computenode will be collapsed into a single row as in the following output:$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ------------------------------------------------------------------------------------------------ 1 0.00000 my-job janedoe qw 06/16/2015 18:03:32 1 1-6:1The above output indicates that tasks 1-6 of job 1 are all in a
qw(queue waiting) state.When tasks get allocated, the output will look something like:
$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ------------------------------------------------------------------------------------------------ 1 0.50000 my-job janedoe r 06/16/2015 18:03:45 all.q@compute002 1 1 1 0.50000 my-job janedoe r 06/16/2015 18:03:45 all.q@compute001 1 2 1 0.50000 my-job janedoe r 06/16/2015 18:03:45 all.q@compute003 1 3 1 0.00000 my-job janedoe qw 06/16/2015 18:03:32 1 4-6:1which indicates tasks 1-3 are all in the
r(running) state, while tasks 4-6 remain in a waiting state.When all tasks have completed
qstatwill produce no output.Checking the logging output of tasks
Each gridengine task will write to an “output” file and an “error” file. These files will be located in the directory the job was launched from (the
HOMEdirectory). The files will be named respectively:- job_name.ojob_id.task_id (for example:
my-job.o1.10) - job_name.ejob_id.task_id (for example:
my-job.e1.10)
The error file will contain any unexpected error output, and will also contain any download and upload logging fromgsutil.- job_name.ojob_id.task_id (for example:
Viewing the results of the jobs
When tasks complete, the result files are uploaded to GCS. You can view the list of output files with
gsutil ls, such as:gsutil ls OUTPUT_PATH
Where the
OUTPUT_PATHshould be the value you specified in the job config file (step 6 above).Viewing log files
When tasks complete, the result log files are uploaded to GCS if
OUTPUT_LOG_PATHwas set in the job config file. The log files can be of value both to verify success/failure of all tasks, as well as to gather some performance statistics before starting a larger job.Count number of successful tasks
gsutil cat OUTPUT_LOG_PATH/* | grep SUCCESS | wc -l
Where the
OUTPUT_LOG_PATHshould be the value you specified in the job config file (step 6 above).Count number of failed tasks
gsutil cat OUTPUT_LOG_PATH/* | grep FAILURE | wc -l
Where the
OUTPUT_LOG_PATHshould be the value you specified in the job config file (step 6 above).Compute total task time
gsutil cat OUTPUT_LOG_PATH/* | \ sed -n -e 's#^Task time.*: \([0-9]*\) seconds#\1#p' | \ awk '{ sum += $1; } END { print sum/NR " seconds"}'
Compute average task time
gsutil cat OUTPUT_LOG_PATH/* | \ sed -n -e 's#^Task time.*: \([0-9]*\) seconds#\1#p' | \ awk '{ sum += $1; } END { print sum " seconds"}'
Destroying the cluster
When you are finished running the samples, disconnect from the master instance and from your workstation shut down the gridengine cluster:
elasticluster stop gridengine
Running your own job¶
To run your own job to index a list of BAM files requires the following:
- Create an
input list file - Create a
job config file - Create a gridengine cluster with sufficient disk space attached to each
computenode - Upload input list file, config file, and
grid-computing-toolssource to the gridengine cluster master - Do a “dry run” (optional)
- Do a “test run” (optional)
- Launch the job
The following instructions provide guidance on each of these steps.
It is recommended, though not a requirement, that you save your
input list file and job config file
to a directory outside the grid-computing-tools directory.
For example, you might create a directory
${WS_ROOT}/my_jobs.
Create an
input list fileIf all of your input files appear in a single directory, then the easiest way to generate a file list is with
gsutil. For example:gsutil ls gs://MY_BUCKET/PATH/*.bam > ${WS_ROOT}/my_jobs/bam_indexing_list_file.txt
Create a
job config fileThe easiest way to create a job config file is to base it off the sample and update:
- INPUT_LIST_FILE
- OUTPUT_PATH
- OUTPUT_LOG_PATH
To have the generated BAM index file written to the same location as the source BAM, set:
OUTPUT_PATH=source
Save the job config file to
${WS_ROOT}/my_jobs/.Create a gridengine cluster with sufficient disk space attached to each
computenodeDetermine disk size requirements
Each
computenode will require sufficient disk space to hold the input and output files for its current task. Determine the largest file in your input list and estimate the total space you will need. It may be necessary to download the file and perform the operation manually to get a maximum combined input and output size.Persistent disk performance also scales with the size of the volume. Independent of storage requirements, for consistent throughput on long running jobs, use a standard persistent disk of at least 1TB, or use SSD persistent disk. More documentation is available for selecting the right persistent disk.
Verify or increase quota
Your choice for number of nodes and disk size must take into account your Compute Engine resource quota for the region of your cluster.
Quota limits and current usage can be viewed with
gcloud compute:gcloud compute regions describe *region*
or in
Cloud Platform Console:Important quota limits include
CPUs,in-use IP addresses, anddisk size.To request additional quota, submit the Compute Engine quota request form.
Configure your cluster
Instructions for setting the boot disk size for the compute nodes of your cluster can be found at Setting the boot disk size.
You will likely want to set the number of
computenodes for your cluster to a number higher than the 3 specified in the example cluster setup instructions.Once configured, start your cluster.
Upload input list file, config file, and
grid-computing-toolssource to the gridengine cluster master
elasticluster sftp gridengine << EOF put ../my_jobs/* mkdir src mkdir src/common mkdir src/samtools put src/common/* src/common/ put src/samtools/* src/samtools/ EOF
Do a “dry run” (optional)
The tool supports the
DRYRUNenvironment variable. Setting this value to 1 when launching your job will cause the queued job to execute without downloading or uploading any files.The local output files, however, will be populated with useful information about what files would be copied. This can be useful for ensuring your file list is valid and that the output path is correct.
For example:
$ DRYRUN=1 ./src/samtools/launch_samtools.sh ./samples/samtools/samtools_index_config.sh Your job-array 2.1-6:1 ("samtools") has been submitted
Then after waiting for the job to complete, inspect:
$ head -n 5 samtools.o3.1 Task host: compute002 Task start: 1 Input list file: ./samples/samtools/samtools_index_file_list.txt Output path: gs://cookbook-bucket/output_path/samtools_index Output log path: gs://cookbook-bucket/log_path/samtools_index $ grep "^Will download:" samtools.o3.* samtools.o3.1:Will download: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot2_high_cov_GRCh37_bams/data/NA12878/alignment/NA12878.chrom9.SOLID.bfast.CEU.high_coverage.20100125.bam to /scratch/samtools.3.1/in/ samtools.o3.2:Will download: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot2_high_cov_GRCh37_bams/data/NA12878/alignment/NA12878.chrom1.LS454.ssaha2.CEU.high_coverage.20100311.bam to /scratch/samtools.3.2/in/ samtools.o3.3:Will download: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA12878/alignment/NA12878.chrom11.SOLID.corona.SRP000032.2009_08.bam to /scratch/samtools.3.3/in/ samtools.o3.4:Will download: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA12878/alignment/NA12878.chrom12.SOLID.corona.SRP000032.2009_08.bam to /scratch/samtools.3.4/in/ samtools.o3.5:Will download: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA12878/alignment/NA12878.chrom10.SOLID.corona.SRP000032.2009_08.bam to /scratch/samtools.3.5/in/ samtools.o3.6:Will download: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/data/NA12878/alignment/NA12878.chromX.SOLID.corona.SRP000032.2009_08.bam to /scratch/samtools.3.6/in/ $ grep "^Will upload:" samtools.o3.* samtools.o3.1:Will upload: /scratch/samtools.3.1/in/NA12878.chrom9.SOLID.bfast.CEU.high_coverage.20100125.bam.bai to gs://cookbook-bucket/output_path/samtools_index/ samtools.o3.2:Will upload: /scratch/samtools.3.2/in/NA12878.chrom1.LS454.ssaha2.CEU.high_coverage.20100311.bam.bai to gs://cookbook-bucket/output_path/samtools_index/ samtools.o3.3:Will upload: /scratch/samtools.3.3/in/NA12878.chrom11.SOLID.corona.SRP000032.2009_08.bam.bai to gs://cookbook-bucket/output_path/samtools_index/ samtools.o3.4:Will upload: /scratch/samtools.3.4/in/NA12878.chrom12.SOLID.corona.SRP000032.2009_08.bam.bai to gs://cookbook-bucket/output_path/samtools_index/ samtools.o3.5:Will upload: /scratch/samtools.3.5/in/NA12878.chrom10.SOLID.corona.SRP000032.2009_08.bam.bai to gs://cookbook-bucket/output_path/samtools_index/ samtools.o3.6:Will upload: /scratch/samtools.3.6/in/NA12878.chromX.SOLID.corona.SRP000032.2009_08.bam.bai to gs://cookbook-bucket/output_path/samtools_index/
Do a “test run” (optional)
The tool supports environment variables to indicate which lines in the input list to run:
- LAUNCH_MIN - lowest line number to process
- LAUNCH_MAX - highest line number to process
For example to launch a Grid Engine array job that only processes line 1:$ LAUNCH_MIN=1 LAUNCH_MAX=1 ./src/samtools/launch_samtools.sh ./samples/samtools/samtools_index_config.sh Your job-array 5.1-1:1 ("samtools") has been submitted
The
LAUNCH_MINandLAUNCH_MAXvalues can be used with theDRYRUNenvironment variable:$ DRYRUN=1 LAUNCH_MIN=1 LAUNCH_MAX=5 ./src/samtools/launch_samtools.sh ./samples/samtools/samtools_index_config.sh Your job-array 6.1-5:1 ("samtools") has been submitted
Launch the job
On the master instance, run the launch script, passing in the config file:
./src/samtools/launch_samtools.sh my_job_config.shwhere my_job_config.sh is replaced by the name of your config file created in step 2.